Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor] CogVideoX followups + tiled decoding support #9150

Merged
merged 19 commits into from
Aug 13, 2024

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Aug 11, 2024

What does this PR do?

Code
import gc

import torch
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler
from diffusers.utils import export_to_video


def reset_memory():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_accumulated_memory_stats()
    torch.cuda.reset_peak_memory_stats()


def print_memory():
    memory = round(torch.cuda.memory_allocated() / 1024**3, 2)
    max_memory = round(torch.cuda.max_memory_allocated() / 1024**3, 2)
    max_reserved = round(torch.cuda.max_memory_reserved() / 1024**3, 2)
    print(f"{memory=} GB")
    print(f"{max_memory=} GB")
    print(f"{max_reserved=} GB")


prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

pipe.enable_model_cpu_offload()

reset_memory()
video = pipe(prompt=prompt, num_frames=48, guidance_scale=6, num_inference_steps=50, generator=torch.Generator().manual_seed(42)).frames[0]
print_memory()
export_to_video(video, "output.mp4", fps=8)

pipe.vae.enable_tiling()

reset_memory()
video = pipe(prompt=prompt, num_frames=48, guidance_scale=6, num_inference_steps=50, generator=torch.Generator().manual_seed(42)).frames[0]
print_memory()
export_to_video(video, "output_tiling.mp4", fps=8)

Memory usage:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.51it/s]
Loading pipeline components...:  40%|████████████████████████████████████████████████████████████████████████████▊                                                                                                                   | 2/5 [00:00<00:00,  3.29it/s]The config attributes {'mid_block_add_attention': True, 'sample_size': 256} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.41it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:44<00:00,  3.28s/it]

# CPU offloading, normal VAE decoding
memory=0.01 GB
max_memory=12.39 GB
max_reserved=20.39 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [02:35<00:00,  3.11s/it]

# CPU offloading, tiled VAE decoding
memory=0.01 GB
max_memory=10.81 GB
max_reserved=10.83 GB

Results:

Normal
output.webm
Tiled
output_tiling.webm

Note that you will need to install accelerate:main from source for this to work and get the expected numbers I'm getting above. If you're using the stable version of accelerate, you might see an addition 5-7GB usage.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @sayakpaul @zRzRzRzRzRzRzR

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w changed the title [refactor] CogVideoX followups + tiled Decoding support [refactor] CogVideoX followups + tiled decoding support Aug 12, 2024
@a-r-r-o-w
Copy link
Member Author

Something interesting/fishy going on with enable_model_cpu_offload. It takes about 1 min 30 seconds when cpu offloading is disabled but ~3 mins with it enabled (so about a 2x slowdown). I assume that the transformer, once in the denoise loop, would not be moving from cpu to cuda and back at every step. Any ideas why this might be happening @sayakpaul?

@sayakpaul sayakpaul requested a review from DN6 August 12, 2024 08:59
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I have left some questions. LMK if they are unclear.

Additionally, let's include a note on the memory savings due to tiling in the docs?

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved
docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved
docs/source/en/api/pipelines/cogvideox.md Show resolved Hide resolved

def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, (CogVideoXEncoder3D, CogVideoXDecoder3D)):
module.gradient_checkpointing = value

def clear_fake_context_parallel_cache(self):
def _clear_fake_context_parallel_cache(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better!

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py Outdated Show resolved Hide resolved
@zRzRzRzRzRzRzR
Copy link
Contributor

Something interesting/fishy going on with enable_model_cpu_offload. It takes about 1 min 30 seconds when cpu offloading is disabled but ~3 mins with it enabled (so about a 2x slowdown). I assume that the transformer, once in the denoise loop, would not be moving from cpu to cuda and back at every step. Any ideas why this might be happening @sayakpaul?enable_model_cpu_offload 有一些有趣/可疑的事情正在发生。当禁用 CPU 卸载时大约需要 1 分 30 秒,但启用时大约需要 3 分钟(大约慢了 2 倍)。我假设变压器一旦进入去噪循环,就不会在每一步都从 CPU 移动到 CUDA 再移回来。有什么想法为什么会发生这种情况 @sayakpaul

I used this method, and the result is also 90 seconds. I didn’t replicate the issue you’re mentioning, so I need to check further. This shouldn’t be an issue.

@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review August 12, 2024 14:15
@a-r-r-o-w
Copy link
Member Author

@sayakpaul I've added a few explanations here. Could you please review again?

@a-r-r-o-w
Copy link
Member Author

I think it would be good to add dynamic positional embeddings as well, to test the generalization capabilities of CogVideoX and remove the 48 frame, 480 height, 720 width limit. I have a POC almost ready for the same. Should I push here and share results in a while, or do it in a separate PR? Shouldn't break anything existing IMO @sayakpaul

@sayakpaul
Copy link
Member

Let’s do separate PR

@a-r-r-o-w
Copy link
Member Author

I've pushed the code to https://github.com/huggingface/diffusers/tree/cogvideox-dynamic-pos-embeds for possibly future reference. After further testing with number of frames greater than 49 and different resolutions, I think the results are not convincing enough to support it. I think best not to add it at the moment

Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏽

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Aug 13, 2024

@sayakpaul, could you check the note about memory optimizations here? If it looks good, we can merge this I think.

cc @zRzRzRzRzRzRzR for visibility

Edit: By the way, accelerate must be installed from source to replicate the memory numbers here. Until the next accelerate release, should we add a note saying the same?

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the memory optims. Sleek!

@a-r-r-o-w a-r-r-o-w merged commit a85b34e into main Aug 13, 2024
18 checks passed
@a-r-r-o-w a-r-r-o-w deleted the cogvideox-followup branch August 13, 2024 22:23
yiyixuxu pushed a commit that referenced this pull request Aug 24, 2024
* refactor context parallel cache; update torch compile time benchmark

* add tiling support

* make style

* remove num_frames % 8 == 0 requirement

* update default num_frames to original value

* add explanations + refactor

* update torch compile example

* update docs

* update

* clean up if-statements

* address review comments

* add test for vae tiling

* update docs

* update docs

* update docstrings

* add modeling test for cogvideox transformer

* make style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants