Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnimateDiff prompt travel #9231

Merged
merged 22 commits into from
Aug 28, 2024
Merged

AnimateDiff prompt travel #9231

merged 22 commits into from
Aug 28, 2024

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Aug 20, 2024

What does this PR do?

Adds support for prompt travel to AnimateDiff pipelines.

Examples

The following are some bare-minimum examples that demonstrate the expected usage of the new features. Note that for latent upscaling, we naively upscale latents and don't make use of a model here (which could be something to explore for the reader as a more complex workflow). Combined with other pipelines and techniques, one can generate really cool animations.

Text-to-Video Prompt Travel

animatediff_multiprompt_2.webm
Code
import torch
import torch.nn.functional as F
from diffusers import AutoencoderKL, AnimateDiffPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_video, load_image

device = "cuda"
dtype = torch.float16

# Load pipeline
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype)

pipe = AnimateDiffPipeline.from_pretrained(
    "emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype
)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")

pipe.load_lora_weights(
    "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora"
)
pipe.set_adapters(["lcm_lora"], [0.8])

# Enable FreeNoise for long prompt generation
pipe.enable_free_noise(context_length=16, context_stride=4)
pipe.to(device)

# Can be a single prompt, or a dictionary with frame timesteps
prompt = {
    0: "A caterpillar on a leaf, high quality, photorealistic",
    40: "A caterpillar transforming into a cocoon, on a leaf, near flowers, photorealistic",
    80: "A cocoon on a leaf, flowers in the backgrond, photorealistic",
    120: "A cocoon maturing and a butterfly being born, flowers and leaves visible in the background, photorealistic",
    160: "A beautiful butterfly, vibrant colors, sitting on a leaf, flowers in the background, photorealistic",
    200: "A beautiful butterfly, flying away in a forest, photorealistic",
    240: "A cyberpunk butterfly, neon lights, glowing",
}
negative_prompt = "bad quality, worst quality, jpeg artifacts"
width = 512
height = 512
num_frames = 256
guidance_scale = 2.5
num_inference_steps = 10

# Run inference
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    generator=torch.Generator("cpu").manual_seed(0),
)

# Save video
frames = output.frames[0]
export_to_video(frames, "output.mp4", fps=16)

Text-to-Video Prompt Travel + Latent Upscale

normal upscaled
animatediff_multiprompt_1.webm
animatediff_multiprompt_1_latent_upscaled.webm
Code
import torch
import torch.nn.functional as F
from diffusers import AutoencoderKL, AnimateDiffPipeline, AnimateDiffVideoToVideoPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_video

device = "cuda"
dtype = torch.float16

# Load models and pipeline
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype)

pipe_txt2vid = AnimateDiffPipeline.from_pretrained(
    "stablediffusionapi/darksushimixv225", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype
)
pipe_txt2vid.scheduler = LCMScheduler.from_config(pipe_txt2vid.scheduler.config, beta_schedule="linear")

pipe_txt2vid.load_lora_weights(
    "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora"
)
pipe_txt2vid.set_adapters(["lcm_lora"], [0.8])

# Enable memory optimizations
# TODO: This might change in the future as the PR is not finalized
context_length = 16
context_stride = 4
pipe_txt2vid.enable_free_noise(context_length=context_length, context_stride=context_stride)
pipe_txt2vid.unet.enable_attn_chunking(context_length)  # Temporal chunking across batch_size x num_frames
pipe_txt2vid.unet.enable_motion_module_chunking(
    (512 // 8 // 4) ** 2
)  # Spatial chunking across batch_size x latent height x latent width
pipe_txt2vid.unet.enable_resnet_chunking(context_length)
pipe_txt2vid.unet.enable_forward_chunking(context_length)
pipe_txt2vid.to(device)

pipe_vid2vid = AnimateDiffVideoToVideoPipeline(
    vae=vae,
    text_encoder=pipe_txt2vid.text_encoder,
    tokenizer=pipe_txt2vid.tokenizer,
    unet=pipe_txt2vid.unet,
    motion_adapter=motion_adapter,
    scheduler=pipe_txt2vid.scheduler,
    feature_extractor=pipe_txt2vid.feature_extractor,
    image_encoder=pipe_txt2vid.image_encoder,
)

# Enable FreeNoise for long context generation
pipe_vid2vid.enable_free_noise(context_length=16, context_stride=4)
pipe_vid2vid.to(device, dtype=dtype)

# Can be a single prompt, or a dictionary with frame timesteps
prompt = {
    0: "a woman on a winter day, sparkly leaves in the background, snow flakes, close up",
    80: "a woman on a summer day, trees visible in the background, close up",
    160: "a woman on a autumn day, yellow leaves in the background, close up",
    240: "a woman on a rainy day, tropical leaves in the background, close up",
}
negative_prompt = "bad quality, worst quality"
width = 512
height = 512
num_frames = 256
guidance_scale = 2.5
num_inference_steps = 10

# Run inference to get latents that will be upscaled
latents = pipe_txt2vid(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    generator=torch.Generator("cpu").manual_seed(0),
    output_type="latent",
).frames

# Run latent upscaling
# Note that only naive upscaling is done here. Alternatively, a latent upscaler
# model could be used
batch_size, num_channels, num_frames, latent_height, latent_width = latents.shape
scale_factor = 1.5
scale_method = "nearest-exact"
upscaled_height = int(height * scale_factor)
upscaled_width = int(width * scale_factor)
upscaled_latent_height = int(latent_height * scale_factor)
upscaled_latent_width = int(latent_width * scale_factor)
strength = 0.6

upscaled_latents = []
for i in range(batch_size):
    latent = F.interpolate(latents[i], size=(upscaled_latent_height, upscaled_latent_width), mode="nearest-exact")
    upscaled_latents.append(latent.unsqueeze(0))
upscaled_latents = torch.cat(upscaled_latents, dim=0)

# Run inference for denoising upscaled latents
output = pipe_vid2vid(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=upscaled_width,
    height=upscaled_height,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    enforce_inference_steps=True,
    generator=torch.Generator("cpu").manual_seed(0),
    output_type="pil",
    latents=upscaled_latents,
    strength=strength,
)

# Save video
frames = output.frames[0]
export_to_video(frames, "output.mp4", fps=16)

Image-to-Video Prompt Travel

animatediff_ipadapter_multiimage.webm
Code
import torch
from diffusers import AutoencoderKL, AnimateDiffPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_video, load_image

device = "cuda"
dtype = torch.float16

# Load models and pipeline
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype)

pipe = AnimateDiffPipeline.from_pretrained(
    "emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype
)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear")

# https://huggingface.co/docs/diffusers/en/using-diffusers/ip_adapter#style--layout-control
pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-plus_sd15.bin")
pipe.set_ip_adapter_scale(1.0)

# Enable FreeNoise for long context generation
pipe.enable_free_noise(context_length=16, context_stride=4)
pipe.load_lora_weights(
    "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora"
)
pipe.set_adapters(["lcm_lora"], [0.9])
pipe = pipe.to(device)

prompt = "A strong man standing in rain, cyberpunk aesthetic, futuristic, bright background"
negative_prompt = "low quality, worst quality, jpeg artifacts"
width = 512
height = 640
num_frames = 32
ip_adapter_image1 = load_image("inputs/man1.png")
ip_adapter_image2 = load_image("inputs/man2.png")

# Run inference
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    ip_adapter_image=[[ip_adapter_image1, ip_adapter_image2]],
    num_frames=num_frames,
    guidance_scale=2.5,
    num_inference_steps=10,
    generator=torch.Generator("cpu").manual_seed(0),
)

# Save video
frames = output.frames[0]
export_to_video(frames, "output.mp4", fps=8)

Image-to-Video Prompt Travel + Latent Upscale

animatediff_ipadapter_multiimage_latent_upscaled.webm
Code
import torch
import torch.nn.functional as F
from diffusers import AutoencoderKL, AnimateDiffPipeline, AnimateDiffVideoToVideoPipeline, LCMScheduler, MotionAdapter
from diffusers.utils import export_to_video, load_image

device = "cuda"
dtype = torch.float16

# Load models and pipeline
motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype)

pipe_txt2vid = AnimateDiffPipeline.from_pretrained(
    "emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype
)
pipe_txt2vid.scheduler = LCMScheduler.from_config(pipe_txt2vid.scheduler.config, beta_schedule="linear")

# https://huggingface.co/docs/diffusers/en/using-diffusers/ip_adapter#style--layout-control
pipe_txt2vid.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-plus_sd15.bin")
pipe_txt2vid.set_ip_adapter_scale(1.0)

# Enable FreeNoise for long context generation
pipe_txt2vid.enable_free_noise(context_length=16, context_stride=4)
pipe_txt2vid.load_lora_weights(
    "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora"
)
pipe_txt2vid.set_adapters(["lcm_lora"], [0.9])
pipe_txt2vid = pipe_txt2vid.to(device)

pipe_vid2vid = AnimateDiffVideoToVideoPipeline(
    vae=vae,
    text_encoder=pipe_txt2vid.text_encoder,
    tokenizer=pipe_txt2vid.tokenizer,
    unet=pipe_txt2vid.unet,
    motion_adapter=motion_adapter,
    scheduler=pipe_txt2vid.scheduler,
    feature_extractor=pipe_txt2vid.feature_extractor,
    image_encoder=pipe_txt2vid.image_encoder,
)

pipe_vid2vid.enable_free_noise(context_length=16, context_stride=4)
pipe_vid2vid.to(device, dtype=dtype)

prompt = "A strong man standing in rain, cyberpunk aesthetic, futuristic, bright background"
negative_prompt = "low quality, worst quality, jpeg artifacts"
width = 512
height = 640
num_frames = 32
ip_adapter_image1 = load_image("inputs/man1.png")
ip_adapter_image2 = load_image("inputs/man2.png")

# Run inference to get latents to be upscaled
latents = pipe_txt2vid(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    ip_adapter_image=[[ip_adapter_image1, ip_adapter_image2]],
    num_frames=num_frames,
    guidance_scale=2.5,
    num_inference_steps=10,
    generator=torch.Generator("cpu").manual_seed(0),
    output_type="latent",
).frames

# Run latent upscaling
# Note that only naive upscaling is done here. Alternatively, a latent upscaler
# model could be used
batch_size, num_channels, num_frames, latent_height, latent_width = latents.shape
scale_factor = 1.5
scale_method = "nearest-exact"
upscaled_height = int(height * scale_factor)
upscaled_width = int(width * scale_factor)
upscaled_latent_height = int(latent_height * scale_factor)
upscaled_latent_width = int(latent_width * scale_factor)
strength = 0.6

upscaled_latents = []
for i in range(batch_size):
    latent = F.interpolate(latents[i], size=(upscaled_latent_height, upscaled_latent_width), mode="nearest-exact")
    upscaled_latents.append(latent.unsqueeze(0))
upscaled_latents = torch.cat(upscaled_latents, dim=0)

# Run pipeline for denoising upscaled latents
output = pipe_vid2vid(
    prompt=prompt,
    negative_prompt=negative_prompt,
    ip_adapter_image=[[ip_adapter_image1, ip_adapter_image2]],
    width=upscaled_width,
    height=upscaled_height,
    guidance_scale=2.5,
    num_inference_steps=10,
    enforce_inference_steps=True,
    generator=torch.Generator("cpu").manual_seed(0),
    output_type="pil",
    latents=upscaled_latents,
    strength=strength,
)

# Save video
frames = output.frames[0]
export_to_video(frames, "output.mp4", fps=8)

Video-to-Video Prompt Travel + ControlNet

TODO: ControlNet has not been optimized for batched inference yet. This will be updated soon.

Code

Video-to-Video Prompt Travel + ControlNet + Latent Upscale

TODO: ControlNet has not been optimized for batched inference yet. This will be updated soon.

Code

Frame interpolation

TODO: SparseCtrl is not supported or optimized for batched inference yet. This will be updated soon.

Code

Memory optimizations

Nothing too fancy here. To lower memory usage, chunking is performed across the spatial batch in motion blocks, across temporal batch in transformer blocks, resnet, upsampling and downsampling blocks, and across spatial/temporal batch in attention feed-forward chunking. This is mostly possible due to the normalization layers being either LayerNorm or GroupNorm which play well for chunked inferenced across batch dimensions. To enable memory optimizations, the following are required (at the time of making the PR - will be updated after reviews):

  • pipe.unet.enable_attn_chunking: Chunking across temporal batches when passing through spatial attention blocks
  • pipe.unet.enable_motion_module_chunking: Chunking across spatial batches when passing through temporal attention blocks
  • pipe.unet.enable_resnet_chunking: Chunking across resnet layers
  • pipe.unet.enable_forward_chunking: Chunking across attention FeedForward layers
Before After

The main pain points, as observed from the memory trace spikes, were the attention layers, resnet layers, upsampling/downsampling layers and a call to torch.where. After improvements, the memory spikes are flattened out but there is still room for improvement with offloading and better batching across all intermediate layers that are not handled perfectly yet. The end goal is to run techniques like FreeNoise in a manner such that the total memory depends only on the context length and scales linearly or remains constant with number of frames.

Adding my memory opt. logs for some things that worked for anyone who's interested in the story from 25 GB for 128 frames to 12 GB by making inference-only changes to the UNet layers. Many other things I tried were very stupid to make the list and didn't work, so those have been omitted. It goes withotu saying that these can be combined with other memory opt. techniques to further reduce overall usage.

log
# With no optimizations (128 frames)
# memory=3.860
# max_memory=25.164
# max_reserved=31.049


# With only FF chunking (128 frames)
# memory=3.866
# max_memory=13.927
# max_reserved=29.797


# With both FF and Resnet chunking (128 frames)
# memory=3.864
# max_memory=13.317
# max_reserved=25.703


# With both FF and Resnet chunking, experiment: inplace op and mask instead of torch.where :( (128 frames)
# memory=3.864
# max_memory=18.841
# max_reserved=27.469


# With both FF and Resnet chunking, experiment attention to_out chunking (128 frames)
# memory=3.866
# max_memory=12.905
# max_reserved=25.158
# After testing above, I added chunking to attention modules across batch dimension which produced
# same numerical results with slightly lower memory usage


# After all of the above, upsampling had the highest memory usage. But, there were also memory spikes (although
# not to the level of upsampling) in torch.where in FreeNoiseTransformerBlock. (early-exit before up blocks
# to measure experiment correctly) (320 frames)
#
# 1. torch.where
# memory=12.013
# max_memory=21.427
# max_reserved=24.348
#
# 2. Chunked implementation of torch.where
# memory=12.010
# max_memory=18.299
# max_reserved=21.986
#
# yeeeeeeeeeeeeee! (NOTE: These are before entering up blocks which are now the bottleneck to beat)
# Choosing 2. and generating 320 frames: (time: 2 mins 51 seconds)
# memory=3.860
# max_memory=25.518
# max_reserved=39.229


# ----- test run to verify outputs (256 frames) -----
# Before optimizations:
# memory=3.859
# max_memory=46.469
# max_reserved=58.127
#
# After optimizations:
# memory=3.864
# max_memory=21.193
# max_reserved=32.260


# With FF and Resnet upsample/downsample chunking (320 frames)
# memory=3.866
# max_memory=22.988
# max_reserved=26.680


# Chunking across 3D transformer (320 frames)
# memory=3.871
# max_memory=19.632
# max_reserved=27.088

Using LoRAs

Since enabling FreeNoise replaces BasicTransformerBlock with FreeNoiseTransformerBlock, any LoRAs loaded into the attention QKV or output projection layers will not be usable. This is because the LoRA config information is not readily available, and calls for some tedious implementation. The easiest way to not have LoRA-related state dict loading failures is to enable FreeNoise BEFORE loading any loras.

# Initialize pipe
# ...

pipe.enable_free_noise(context_length=16, context_stride=4)

# ...
# Load motion or stylistic or any other loras

Additionally, you could load, fuse and then unload the loras. With this approach, the ordering of the enable_free_noise call would not matter.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@aycax
Copy link

aycax commented Aug 22, 2024

hi! I tried to recreate the video with the girl and seasons but i got an error like this:
AttributeError: 'UNetMotionModel' object has no attribute 'enable_attn_chunking'

here is my notebook

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Aug 22, 2024

@aycax Thanks for testing! I recently made some updates to the chunked inference code design. You will now have to do for the current version of the PR:

context_length = 16
context_stride = 4
pipe.enable_free_noise(context_length=context_length, context_stride=context_stride)
pipe.enable_free_noise_chunked_inference()
pipe.unet.enable_forward_chunking(context_length)

Or, you could install the version of this PR before this commit. Since this PR is a work-in-progress, some things might be changed unexpectedly, but I'll be sure to update all the example code when ready to reflect correct usage.

@a-r-r-o-w a-r-r-o-w requested a review from DN6 August 22, 2024 19:34
@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review August 24, 2024 00:09
@a-r-r-o-w a-r-r-o-w changed the title AnimateDiff prompt travel and memory optimizations AnimateDiff prompt travel Aug 24, 2024
@@ -1087,8 +1104,15 @@ def forward(
accumulated_values[:, frame_start:frame_end] += hidden_states_chunk * weights
num_times_accumulated[:, frame_start:frame_end] += weights

hidden_states = torch.where(
num_times_accumulated > 0, accumulated_values / num_times_accumulated, accumulated_values
hidden_states = torch.cat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to include this change in the memory optimisations no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Will revert here

@@ -69,6 +70,9 @@ def _enable_free_noise_in_block(self, block: Union[CrossAttnDownBlockMotion, Dow
motion_module.transformer_blocks[i].load_state_dict(
basic_transfomer_block.state_dict(), strict=True
)
motion_module.transformer_blocks[i].set_chunk_feed_forward(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also probably better to include in the memory optimisations PR?

@DN6 DN6 merged commit cbc2ec8 into main Aug 28, 2024
18 checks passed
@a-r-r-o-w a-r-r-o-w deleted the animatediff/freenoise-improvements branch August 28, 2024 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants