Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Freenoise memory improvements #9262

Merged
merged 24 commits into from
Sep 6, 2024

Conversation

a-r-r-o-w
Copy link
Member

What does this PR do?

Memory improvements from #9231. The previous PR had too many changes so it has been split to make it easier to review. This PR contains just the FreeNoise memory improvements whereas the previous contains prompt travel support. This PR will be ready for review after 9231 has been merged and this has been rebased.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review September 5, 2024 05:52
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not requested for a review but still did to learn some things. Thanks, Aryan!

) -> None:
for i in range(len(attentions)):
attentions[i] = SplitInferenceModule(
attentions[i], temporal_split_size, 0, ["hidden_states", "encoder_hidden_states"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ["hidden_states", "encoder_hidden_states"] not be configurable or not really?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand the comment very well.

SplitInferenceModule sets input_kwargs_to_split to ["hidden_states"] by default if no parameter is passed. I want both hidden_states and encoder_hidden_states to be split based on split_size here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Okay. I am assuming that is a reasonable default to choose? I was wondering if it could make sense to let the users choose the inputs they wanna split?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say let's keep it in mind to allow users to have more control on this, but for now let's keep the scope of changes minimal. I would like to experiment on FreeNoise for CogVideoX as discussed internally, and so would like to get this in soon :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then we could add a comment before the blocks that could be configured and revisit those if needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated. WDYT?

Comment on lines +507 to +517
if getattr(block, "motion_modules", None) is not None:
self._enable_split_inference_motion_modules_(block.motion_modules, spatial_split_size)
if getattr(block, "attentions", None) is not None:
self._enable_split_inference_attentions_(block.attentions, temporal_split_size)
if getattr(block, "resnets", None) is not None:
self._enable_split_inference_resnets_(block.resnets, temporal_split_size)
if getattr(block, "downsamplers", None) is not None:
self._enable_split_inference_samplers_(block.downsamplers, temporal_split_size)
if getattr(block, "upsamplers", None) is not None:
self._enable_split_inference_samplers_(block.upsamplers, temporal_split_size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Should attentions, resnets, etc. ne not configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this comment either. Basically, we're going to be splitting across the batch dimension for the layers based on chosen spatial_split_size and temporal_split_size values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly.

Basically, we're going to be splitting across the batch dimension for the layers based on chosen spatial_split_size and temporal_split_size values

Could it make sense to let the users to choose the kind of layers that wanna apply splitting? Perhaps we default to all (attentions, motion_modules, resnets, downsamplers, upsamplers,) or not really?

src/diffusers/pipelines/free_noise_utils.py Outdated Show resolved Hide resolved
Comment on lines +1107 to +1115
hidden_states = torch.cat(
[
torch.where(num_times_split > 0, accumulated_split / num_times_split, accumulated_split)
for accumulated_split, num_times_split in zip(
accumulated_values.split(self.context_length, dim=1),
num_times_accumulated.split(self.context_length, dim=1),
)
],
dim=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this seems to be a form of chunking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. At some point lowering the peaks on memory traces, torch.where became the bottleneck. This was actually first noticed by @DN6 so credits to him

Copy link
Member

@sayakpaul sayakpaul Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we know the situations where torch.where() leads to spikes? Seems a little weird to me honestly because native conditionals like torch.where() are supposed to be more efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spike that we see is due to tensors being copied. The intermediate dimensions for attention get large when generating many frames (let's say, 200+) here. We could do something different here too - I just did what seemed like the easiest thing to do (as these changes were made when I was trying out different things in quick succession to golf the memory spikes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool. Let's perhaps make a note of this to reivsit later? At least this way, we are aware?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, made a note. LMK if any further changes needed on this :)

@a-r-r-o-w
Copy link
Member Author

@sayakpaul WDYT about the explanation of SplitInferenceModule now?

@a-r-r-o-w a-r-r-o-w requested a review from DN6 September 5, 2024 10:04
Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks really good 👍🏽 Could you just add a fast GPU test to verify that the split inference outputs and normal outputs are the same?

@@ -70,6 +168,9 @@ def _enable_free_noise_in_block(self, block: Union[CrossAttnDownBlockMotion, Dow
motion_module.transformer_blocks[i].load_state_dict(
basic_transfomer_block.state_dict(), strict=True
)
motion_module.transformer_blocks[i].set_chunk_feed_forward(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always need chunked feed forward set when enabling free noise? Might be overkill no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only there to carry forward the chunk FF behaviour if and only if it was already enabled in the BasicTransformerBlock. Basically, if it was not enable in BTB, motion_module.transformer_blocks[i]._chunk_size would be None leading to default behaviour of no chunking. If it was enabled in BTB, it would by default carry forward to FreeNoiseTransformerBlock

input_kwargs_to_split: List[str] = ["hidden_states"],
) -> None:
super().__init__()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also add a docstring here to explain the init arguments. Maybe the workflow example in forward can be moved up here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Comment on lines +113 to +115
for split_input in zip(*split_inputs.values()):
inputs = dict(zip(split_inputs.keys(), split_input))
inputs.update(kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean 😎

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too proud of this one 😬 but hey, I'm not afraid of not understanding what sorcery is going on here two weeks later

image

@a-r-r-o-w a-r-r-o-w merged commit 6dfa499 into main Sep 6, 2024
18 checks passed
@a-r-r-o-w a-r-r-o-w deleted the animatediff/freenoise-memory-improvements branch September 6, 2024 07:30
sayakpaul pushed a commit that referenced this pull request Sep 6, 2024
---------

Co-authored-by: yiyixuxu <[email protected]>

Update `UNet2DConditionModel`'s error messages (#9230)

* refactor

[CI] Update Single file Nightly Tests (#9357)

* update

* update

feedback.

improve README for flux dreambooth lora (#9290)

* improve readme

* improve readme

* improve readme

* improve readme

fix one uncaught deprecation warning for accessing vae_latent_channels in VaeImagePreprocessor (#9372)

deprecation warning vae_latent_channels

add mixed int8 tests and more tests to nf4.

[core] Freenoise memory improvements (#9262)

* update

* implement prompt interpolation

* make style

* resnet memory optimizations

* more memory optimizations; todo: refactor

* update

* update animatediff controlnet with latest changes

* refactor chunked inference changes

* remove print statements

* update

* chunk -> split

* remove changes from incorrect conflict resolution

* remove changes from incorrect conflict resolution

* add explanation of SplitInferenceModule

* update docs

* Revert "update docs"

This reverts commit c55a50a.

* update docstring for freenoise split inference

* apply suggestions from review

* add tests

* apply suggestions from review

quantization docs.

docs.
sayakpaul added a commit that referenced this pull request Oct 21, 2024
* quantization config.

* fix-copies

* fix

* modules_to_not_convert

* add bitsandbytes utilities.

* make progress.

* fixes

* quality

* up

* up

rotary embedding refactor 2: update comments, fix dtype for use_real=False (#9312)

fix notes and dtype

up

up

* minor

* up

* up

* fix

* provide credits where due.

* make configurations work.

* fixes

* fix

* update_missing_keys

* fix

* fix

* make it work.

* fix

* provide credits to transformers.

* empty commit

* handle to() better.

* tests

* change to bnb from bitsandbytes

* fix tests

fix slow quality tests

SD3 remark

fix

complete int4 tests

add a readme to the test files.

add model cpu offload tests

warning test

* better safeguard.

* change merging status

* courtesy to transformers.

* move  upper.

* better

* make the unused kwargs warning friendlier.

* harmonize changes with huggingface/transformers#33122

* style

* trainin tests

* feedback part i.

* Add Flux inpainting and Flux Img2Img (#9135)

---------

Co-authored-by: yiyixuxu <[email protected]>

Update `UNet2DConditionModel`'s error messages (#9230)

* refactor

[CI] Update Single file Nightly Tests (#9357)

* update

* update

feedback.

improve README for flux dreambooth lora (#9290)

* improve readme

* improve readme

* improve readme

* improve readme

fix one uncaught deprecation warning for accessing vae_latent_channels in VaeImagePreprocessor (#9372)

deprecation warning vae_latent_channels

add mixed int8 tests and more tests to nf4.

[core] Freenoise memory improvements (#9262)

* update

* implement prompt interpolation

* make style

* resnet memory optimizations

* more memory optimizations; todo: refactor

* update

* update animatediff controlnet with latest changes

* refactor chunked inference changes

* remove print statements

* update

* chunk -> split

* remove changes from incorrect conflict resolution

* remove changes from incorrect conflict resolution

* add explanation of SplitInferenceModule

* update docs

* Revert "update docs"

This reverts commit c55a50a.

* update docstring for freenoise split inference

* apply suggestions from review

* add tests

* apply suggestions from review

quantization docs.

docs.

* Revert "Add Flux inpainting and Flux Img2Img (#9135)"

This reverts commit 5799954.

* tests

* don

* Apply suggestions from code review

Co-authored-by: Steven Liu <[email protected]>

* contribution guide.

* changes

* empty

* fix tests

* harmonize with huggingface/transformers#33546.

* numpy_cosine_distance

* config_dict modification.

* remove if config comment.

* note for load_state_dict changes.

* float8 check.

* quantizer.

* raise an error for non-True low_cpu_mem_usage values when using quant.

* low_cpu_mem_usage shenanigans when using fp32 modules.

* don't re-assign _pre_quantization_type.

* make comments clear.

* remove comments.

* handle mixed types better when moving to cpu.

* add tests to check if we're throwing warning rightly.

* better check.

* fix 8bit test_quality.

* handle dtype more robustly.

* better message when keep_in_fp32_modules.

* handle dtype casting.

* fix dtype checks in pipeline.

* fix warning message.

* Update src/diffusers/models/modeling_utils.py

Co-authored-by: YiYi Xu <[email protected]>

* mitigate the confusing cpu warning

---------

Co-authored-by: Vishnu V Jaddipal <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: YiYi Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants