[core] Freenoise memory improvements #9262

a-r-r-o-w · 2024-08-24T00:15:41Z

What does this PR do?

Memory improvements from #9231. The previous PR had too many changes so it has been split to make it easier to review. This PR contains just the FreeNoise memory improvements whereas the previous contains prompt travel support. This PR will be ready for review after 9231 has been merged and this has been rebased.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6

HuggingFaceDocBuilderDev · 2024-08-24T00:21:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/diffusers/models/attention_processor.py

sayakpaul

I was not requested for a review but still did to learn some things. Thanks, Aryan!

sayakpaul · 2024-09-05T06:52:07Z

src/diffusers/pipelines/free_noise_utils.py

+    ) -> None:
+        for i in range(len(attentions)):
+            attentions[i] = SplitInferenceModule(
+                attentions[i], temporal_split_size, 0, ["hidden_states", "encoder_hidden_states"]


Should ["hidden_states", "encoder_hidden_states"] not be configurable or not really?

Sorry, I don't understand the comment very well.

SplitInferenceModule sets input_kwargs_to_split to ["hidden_states"] by default if no parameter is passed. I want both hidden_states and encoder_hidden_states to be split based on split_size here

Sorry. Okay. I am assuming that is a reasonable default to choose? I was wondering if it could make sense to let the users choose the inputs they wanna split?

I would say let's keep it in mind to allow users to have more control on this, but for now let's keep the scope of changes minimal. I would like to experiment on FreeNoise for CogVideoX as discussed internally, and so would like to get this in soon :)

Okay then we could add a comment before the blocks that could be configured and revisit those if needed?

Thanks, updated. WDYT?

sayakpaul · 2024-09-05T06:52:41Z

src/diffusers/pipelines/free_noise_utils.py

+            if getattr(block, "motion_modules", None) is not None:
+                self._enable_split_inference_motion_modules_(block.motion_modules, spatial_split_size)
+            if getattr(block, "attentions", None) is not None:
+                self._enable_split_inference_attentions_(block.attentions, temporal_split_size)
+            if getattr(block, "resnets", None) is not None:
+                self._enable_split_inference_resnets_(block.resnets, temporal_split_size)
+            if getattr(block, "downsamplers", None) is not None:
+                self._enable_split_inference_samplers_(block.downsamplers, temporal_split_size)
+            if getattr(block, "upsamplers", None) is not None:
+                self._enable_split_inference_samplers_(block.upsamplers, temporal_split_size)
+


Same. Should attentions, resnets, etc. ne not configurable?

Not sure I understand this comment either. Basically, we're going to be splitting across the batch dimension for the layers based on chosen spatial_split_size and temporal_split_size values

Similarly.

Basically, we're going to be splitting across the batch dimension for the layers based on chosen spatial_split_size and temporal_split_size values

Could it make sense to let the users to choose the kind of layers that wanna apply splitting? Perhaps we default to all (attentions, motion_modules, resnets, downsamplers, upsamplers,) or not really?

src/diffusers/pipelines/free_noise_utils.py

src/diffusers/models/unets/unet_motion_model.py

sayakpaul · 2024-09-05T06:55:30Z

src/diffusers/models/attention.py

+        hidden_states = torch.cat(
+            [
+                torch.where(num_times_split > 0, accumulated_split / num_times_split, accumulated_split)
+                for accumulated_split, num_times_split in zip(
+                    accumulated_values.split(self.context_length, dim=1),
+                    num_times_accumulated.split(self.context_length, dim=1),
+                )
+            ],
+            dim=1,


So, this seems to be a form of chunking?

Yep. At some point lowering the peaks on memory traces, torch.where became the bottleneck. This was actually first noticed by @DN6 so credits to him

Hmm, do we know the situations where torch.where() leads to spikes? Seems a little weird to me honestly because native conditionals like torch.where() are supposed to be more efficient.

I think the spike that we see is due to tensors being copied. The intermediate dimensions for attention get large when generating many frames (let's say, 200+) here. We could do something different here too - I just did what seemed like the easiest thing to do (as these changes were made when I was trying out different things in quick succession to golf the memory spikes)

Ah cool. Let's perhaps make a note of this to reivsit later? At least this way, we are aware?

Alright, made a note. LMK if any further changes needed on this :)

This reverts commit c55a50a.

a-r-r-o-w · 2024-09-05T10:04:47Z

@sayakpaul WDYT about the explanation of SplitInferenceModule now?

src/diffusers/pipelines/free_noise_utils.py

DN6

PR looks really good 👍🏽 Could you just add a fast GPU test to verify that the split inference outputs and normal outputs are the same?

DN6 · 2024-09-05T10:14:49Z

src/diffusers/pipelines/free_noise_utils.py

@@ -70,6 +168,9 @@ def _enable_free_noise_in_block(self, block: Union[CrossAttnDownBlockMotion, Dow
                    motion_module.transformer_blocks[i].load_state_dict(
                        basic_transfomer_block.state_dict(), strict=True
                    )
+                    motion_module.transformer_blocks[i].set_chunk_feed_forward(


Do we always need chunked feed forward set when enabling free noise? Might be overkill no?

This is only there to carry forward the chunk FF behaviour if and only if it was already enabled in the BasicTransformerBlock. Basically, if it was not enable in BTB, motion_module.transformer_blocks[i]._chunk_size would be None leading to default behaviour of no chunking. If it was enabled in BTB, it would by default carry forward to FreeNoiseTransformerBlock

DN6 · 2024-09-05T10:17:45Z

src/diffusers/pipelines/free_noise_utils.py

+        input_kwargs_to_split: List[str] = ["hidden_states"],
+    ) -> None:
+        super().__init__()
+


Would also add a docstring here to explain the init arguments. Maybe the workflow example in forward can be moved up here too?

Sounds good!

DN6 · 2024-09-05T10:22:05Z

src/diffusers/pipelines/free_noise_utils.py

+        for split_input in zip(*split_inputs.values()):
+            inputs = dict(zip(split_inputs.keys(), split_input))
+            inputs.update(kwargs)


Not too proud of this one 😬 but hey, I'm not afraid of not understanding what sorcery is going on here two weeks later

--------- Co-authored-by: yiyixuxu <[email protected]> Update `UNet2DConditionModel`'s error messages (#9230) * refactor [CI] Update Single file Nightly Tests (#9357) * update * update feedback. improve README for flux dreambooth lora (#9290) * improve readme * improve readme * improve readme * improve readme fix one uncaught deprecation warning for accessing vae_latent_channels in VaeImagePreprocessor (#9372) deprecation warning vae_latent_channels add mixed int8 tests and more tests to nf4. [core] Freenoise memory improvements (#9262) * update * implement prompt interpolation * make style * resnet memory optimizations * more memory optimizations; todo: refactor * update * update animatediff controlnet with latest changes * refactor chunked inference changes * remove print statements * update * chunk -> split * remove changes from incorrect conflict resolution * remove changes from incorrect conflict resolution * add explanation of SplitInferenceModule * update docs * Revert "update docs" This reverts commit c55a50a. * update docstring for freenoise split inference * apply suggestions from review * add tests * apply suggestions from review quantization docs. docs.

* quantization config. * fix-copies * fix * modules_to_not_convert * add bitsandbytes utilities. * make progress. * fixes * quality * up * up rotary embedding refactor 2: update comments, fix dtype for use_real=False (#9312) fix notes and dtype up up * minor * up * up * fix * provide credits where due. * make configurations work. * fixes * fix * update_missing_keys * fix * fix * make it work. * fix * provide credits to transformers. * empty commit * handle to() better. * tests * change to bnb from bitsandbytes * fix tests fix slow quality tests SD3 remark fix complete int4 tests add a readme to the test files. add model cpu offload tests warning test * better safeguard. * change merging status * courtesy to transformers. * move upper. * better * make the unused kwargs warning friendlier. * harmonize changes with huggingface/transformers#33122 * style * trainin tests * feedback part i. * Add Flux inpainting and Flux Img2Img (#9135) --------- Co-authored-by: yiyixuxu <[email protected]> Update `UNet2DConditionModel`'s error messages (#9230) * refactor [CI] Update Single file Nightly Tests (#9357) * update * update feedback. improve README for flux dreambooth lora (#9290) * improve readme * improve readme * improve readme * improve readme fix one uncaught deprecation warning for accessing vae_latent_channels in VaeImagePreprocessor (#9372) deprecation warning vae_latent_channels add mixed int8 tests and more tests to nf4. [core] Freenoise memory improvements (#9262) * update * implement prompt interpolation * make style * resnet memory optimizations * more memory optimizations; todo: refactor * update * update animatediff controlnet with latest changes * refactor chunked inference changes * remove print statements * update * chunk -> split * remove changes from incorrect conflict resolution * remove changes from incorrect conflict resolution * add explanation of SplitInferenceModule * update docs * Revert "update docs" This reverts commit c55a50a. * update docstring for freenoise split inference * apply suggestions from review * add tests * apply suggestions from review quantization docs. docs. * Revert "Add Flux inpainting and Flux Img2Img (#9135)" This reverts commit 5799954. * tests * don * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * contribution guide. * changes * empty * fix tests * harmonize with huggingface/transformers#33546. * numpy_cosine_distance * config_dict modification. * remove if config comment. * note for load_state_dict changes. * float8 check. * quantizer. * raise an error for non-True low_cpu_mem_usage values when using quant. * low_cpu_mem_usage shenanigans when using fp32 modules. * don't re-assign _pre_quantization_type. * make comments clear. * remove comments. * handle mixed types better when moving to cpu. * add tests to check if we're throwing warning rightly. * better check. * fix 8bit test_quality. * handle dtype more robustly. * better message when keep_in_fp32_modules. * handle dtype casting. * fix dtype checks in pipeline. * fix warning message. * Update src/diffusers/models/modeling_utils.py Co-authored-by: YiYi Xu <[email protected]> * mitigate the confusing cpu warning --------- Co-authored-by: Vishnu V Jaddipal <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: YiYi Xu <[email protected]>

a-r-r-o-w added 12 commits August 14, 2024 16:21

update

d0a81ae

implement prompt interpolation

d55903d

make style

a86eabe

resnet memory optimizations

94438e1

more memory optimizations; todo: refactor

74e3ab0

update

ec91064

update animatediff controlnet with latest changes

6568681

Merge branch 'main' into animatediff/freenoise-improvements

76f931d

refactor chunked inference changes

761c44d

remove print statements

6830fb0

Merge branch 'main' into animatediff/freenoise-improvements

49e40ef

update

9e215c0

DN6 reviewed Aug 26, 2024

View reviewed changes

src/diffusers/models/attention_processor.py Outdated Show resolved Hide resolved

a-r-r-o-w added 2 commits September 5, 2024 07:44

Merge branch 'main' into animatediff/freenoise-memory-improvements

2cef5c7

chunk -> split

fb96059

a-r-r-o-w marked this pull request as ready for review September 5, 2024 05:52

a-r-r-o-w added 2 commits September 5, 2024 07:58

remove changes from incorrect conflict resolution

dc2c12b

remove changes from incorrect conflict resolution

12f0ae1

sayakpaul reviewed Sep 5, 2024

View reviewed changes

a-r-r-o-w added 4 commits September 5, 2024 11:45

add explanation of SplitInferenceModule

661a0b3

update docs

c55a50a

Merge branch 'main' into animatediff/freenoise-memory-improvements

8797cc3

Revert "update docs"

32961be

This reverts commit c55a50a.

a-r-r-o-w requested a review from DN6 September 5, 2024 10:04

update docstring for freenoise split inference

256ee34

a-r-r-o-w commented Sep 5, 2024

View reviewed changes

src/diffusers/pipelines/free_noise_utils.py Outdated Show resolved Hide resolved

a-r-r-o-w commented Sep 5, 2024

View reviewed changes

src/diffusers/pipelines/free_noise_utils.py Outdated Show resolved Hide resolved

apply suggestions from review

c7bf8dd

DN6 approved these changes Sep 5, 2024

View reviewed changes

a-r-r-o-w added 2 commits September 5, 2024 13:31

add tests

9e556be

apply suggestions from review

098bfd1

a-r-r-o-w merged commit 6dfa499 into main Sep 6, 2024
18 checks passed

a-r-r-o-w deleted the animatediff/freenoise-memory-improvements branch September 6, 2024 07:30

a-r-r-o-w mentioned this pull request Sep 11, 2024

[docs] AnimateDiff FreeNoise #9414

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Freenoise memory improvements #9262

[core] Freenoise memory improvements #9262

a-r-r-o-w commented Aug 24, 2024

HuggingFaceDocBuilderDev commented Aug 24, 2024

sayakpaul left a comment

sayakpaul Sep 5, 2024

a-r-r-o-w Sep 5, 2024

sayakpaul Sep 5, 2024

a-r-r-o-w Sep 5, 2024

sayakpaul Sep 5, 2024

a-r-r-o-w Sep 5, 2024

sayakpaul Sep 5, 2024

a-r-r-o-w Sep 5, 2024

sayakpaul Sep 5, 2024

sayakpaul Sep 5, 2024

a-r-r-o-w Sep 5, 2024

sayakpaul Sep 5, 2024 •

edited

Loading

a-r-r-o-w Sep 5, 2024

sayakpaul Sep 5, 2024

a-r-r-o-w Sep 5, 2024

a-r-r-o-w commented Sep 5, 2024

DN6 left a comment

DN6 Sep 5, 2024

a-r-r-o-w Sep 5, 2024

DN6 Sep 5, 2024

a-r-r-o-w Sep 5, 2024

DN6 Sep 5, 2024

a-r-r-o-w Sep 5, 2024

[core] Freenoise memory improvements #9262

[core] Freenoise memory improvements #9262

Conversation

a-r-r-o-w commented Aug 24, 2024

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Aug 24, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-r-r-o-w commented Sep 5, 2024

DN6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul Sep 5, 2024 •

edited

Loading