Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community Pipeline] Add 🪆Matryoshka Diffusion Models #9157

Merged
merged 137 commits into from
Oct 14, 2024

Conversation

tolgacangoz
Copy link
Contributor

@tolgacangoz tolgacangoz commented Aug 12, 2024

Thanks for the opportunity to work on this model!

The Abstract of the paper:

Diffusion models are the de-facto approach for generating high-quality images and videos but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space, or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion (MDM), a novel framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small scale inputs are nested within those of the large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024 × 1024 pixels, demonstrating strong zero shot generalization using the CC12M dataset, which contains only 12 million images. Code and pre-trained checkpoints are released at https://github.com/apple/ml-mdm.

Paper: 🪆Matryoshka Diffusion Models
Repository: https://github.com/apple/ml-mdm
Hugging Face Space: https://huggingface.co/spaces/pcuenq/mdm
License: MIT license

image

Key takeaways from the paper:

  • VAE: None; since Matryoshka Diffusion Models work on the (extended) pixel space(s).
  • Text-encoder: flan-t5-xl
  • Enables:
    1. a multi-resolution loss that greatly improves the convergence speed of high-resolution input denoising.
    2. an efficient progressive training schedule, that starts by training a low-resolution diffusion model and gradually adds high-resolution inputs and outputs following a schedule. This speeds up the overall convergence.
  • MDM allows us to train high-resolution models without resorting to cascaded (Since each model is trained separately, the generation quality can be bottlenecked by the exposure bias (Bengio et al., 2015) from imperfect predictions and several models need to be trained corresponding to different resolutions.) or latent diffusion (This not only increases the complexity of learning but also bounds the generation quality due to the lossy compression process.), and other end-to-end models (without fully considering the innate structure of hierarchical generation, their results lag behind cascaded and latent models.)
  • Resolution-specific noise schedules are used.
  • Allocating more computation in the low-resolution feature maps.
  • MDM has extensive parameter sharing across resolutions.
  • Authors see that increasing from two resolution levels to three consistently improves the model's convergence. Note that increasing the number of nesting levels brings only negligible costs.
  • LDM and MDM methods are complementary. It is possible to build MDM on top of autoencoder codes.

TODOs:
✅ The U-Net; in other words, the inner-most structure, nesting_level=0; approximately would be as follows:

UNet2DConditionModel(in_channels=3, out_channels=3, block_out_channels=(256, 512, 768),
		cross_attention_dim=2048, resnet_time_scale_shift='scale_shift',
		down_block_types=('DownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D'),
		up_block_types=('CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'UpBlock2D'),
		ff_act_fn='gelu', transformer_layers_per_block=[0, 1, 5],
		use_linear_projection='no_projection', attention_bias=True,
		norm_type='layer_norm_matryoshka', ff_norm_type='group_norm_matryoshka',
		cross_attention_norm='layer_norm', attention_pre_only=True,
		encoder_hid_dim_type='text_proj', encoder_hid_dim=2048,
		flip_sin_to_cos=False, masked_cross_attention=False,
		micro_conditioning_scale=64, addition_embed_type='matryoshka')

✅ Scheduler:

  • Calculates timesteps and utilizes prev_timestep in a slightly different way. Gives t-1 timestep to the unet, gives t to the scheduler, and doesn't use the last timestep.
  • nesting_level=1-type uses 2 noise matrices: 3×64×64 and 3×256×256. And, nesting_level=2-type uses 3 noise matrices: 3×64×64, 3×256×256, 3×1024×1024. Each noise matrix has its own calculations in the scheduler. One produces 3 images from a nesting_level=2 model with 3 different resolutions.
  • There might be some optimizations possible; e.g., the scheduler makes its calculations sequentially for each noise matrix, and since each element has a different shape one cannot utilize broadcasting I guess. IMHO, one could make them equal in shape, make calculations with broadcasting, and utilize masking at the end at the expense of more memory usage.
scheduler = MatryoshkaDDIMScheduler(prediction_type="v_prediction",
		beta_schedule="squaredcos_cap_v2", timestep_spacing="matryoshka_style",)

convert_matryoshka_model_to_diffusers.py
✅ Show example results:

prompt0 = "a blue jay stops on the top of a helmet of Japanese samurai, background with sakura tree"
prompt = f"breathtaking {prompt0}. award-winning, professional, highly detailed"
image = pipe(prompt=prompt, num_inference_steps=50+).images
make_image_grid(image, rows=1, cols=len(image))
  • 64×64, nesting_level=0: 1.719 GiB. With 50 DDIM inference steps:
64x64
bird_64_64
  • 256×256, nesting_level=1: 1.776 GiB. With 150 DDIM inference steps:
64x64 256x256
bird_256_64 bird_256_256
  • 1024×1024, nesting_level=2: 1.792 GiB. As one can realize the cost of adding another layer is really negligible in this context! With 250 DDIM inference steps:
64x64 256x256 1024x1024
bird_1024_64 bird_1024_256 bird_1024_1024

✅ Finish HF integration & upload converted checkpoints to HF.
README.md
⏳ Make it as simple as possible, but not simpler. Note: I could make small additions/modifications in the future, e.g., for comments, etc...
examples/**/train_matryoshka.py

Open In Colab

I would like to congratulate you for this great work and thank you for open-sourcing the codebase with MIT license @MultiPath, @Shuangfei, @dreasysnail, Josh Susskind, @ndjaitly, @luke-carlson!

I believe/anticipate that this kind of representation learning will become popular, that acceleration improvements from contemporary diffusion modeling will be adapted to this model, and that training will be democratized without the need for large resources in the future.

@sayakpaul @pcuenca @a-r-r-o-w

@tolgacangoz tolgacangoz changed the title Add Matryoshka Diffusion Models Add 🪆Matryoshka Diffusion Models Aug 12, 2024
@sayakpaul
Copy link
Member

@tolgacangoz would you have cycles to work on this soon? Another contributor has expressed interest in working on it. Maybe you two could collaborate?

@tolgacangoz
Copy link
Contributor Author

I am into the inference code atm. Will the training code in examples/**/train_matryoshka.py be implemented as well (since this is a very efficient model in training)? If so, he can take this up.

@sayakpaul
Copy link
Member

For now, we don't have to focus on training.

@tolgacangoz tolgacangoz changed the title Add 🪆Matryoshka Diffusion Models [Community Pipeline] Add 🪆Matryoshka Diffusion Models Sep 7, 2024
@tolgacangoz tolgacangoz marked this pull request as ready for review October 13, 2024 10:43
@tolgacangoz tolgacangoz marked this pull request as draft October 13, 2024 16:26
@tolgacangoz tolgacangoz marked this pull request as ready for review October 13, 2024 17:03
@tolgacangoz tolgacangoz marked this pull request as draft October 13, 2024 19:02
@luke-carlson
Copy link

Thank you for working on this @tolgacangoz!

@tolgacangoz tolgacangoz marked this pull request as ready for review October 14, 2024 08:37
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@yiyixuxu yiyixuxu merged commit 56c2115 into huggingface:main Oct 14, 2024
8 checks passed
@tolgacangoz
Copy link
Contributor Author

Thanks for merging!

@tolgacangoz tolgacangoz deleted the Add-Matryoshka-Diffusion-Models branch October 15, 2024 06:59
@tolgacangoz tolgacangoz restored the Add-Matryoshka-Diffusion-Models branch October 15, 2024 07:53
@tolgacangoz tolgacangoz deleted the Add-Matryoshka-Diffusion-Models branch October 15, 2024 07:54
@luke-carlson
Copy link

Hey @tolgacangoz, are there any changes we need to make here to incorporate Jiatao's latest changes apple/ml-mdm#21

@tolgacangoz
Copy link
Contributor Author

tolgacangoz commented Oct 15, 2024

Probably. I will look into it tomorrow.

Edit: The usage of schedule_shifted_power seems to be changed. I will make the necessary changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants