-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Freenoise memory improvements #9262
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not requested for a review but still did to learn some things. Thanks, Aryan!
) -> None: | ||
for i in range(len(attentions)): | ||
attentions[i] = SplitInferenceModule( | ||
attentions[i], temporal_split_size, 0, ["hidden_states", "encoder_hidden_states"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ["hidden_states", "encoder_hidden_states"]
not be configurable or not really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand the comment very well.
SplitInferenceModule
sets input_kwargs_to_split
to ["hidden_states"]
by default if no parameter is passed. I want both hidden_states and encoder_hidden_states to be split based on split_size
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Okay. I am assuming that is a reasonable default to choose? I was wondering if it could make sense to let the users choose the inputs they wanna split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say let's keep it in mind to allow users to have more control on this, but for now let's keep the scope of changes minimal. I would like to experiment on FreeNoise for CogVideoX as discussed internally, and so would like to get this in soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay then we could add a comment before the blocks that could be configured and revisit those if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated. WDYT?
if getattr(block, "motion_modules", None) is not None: | ||
self._enable_split_inference_motion_modules_(block.motion_modules, spatial_split_size) | ||
if getattr(block, "attentions", None) is not None: | ||
self._enable_split_inference_attentions_(block.attentions, temporal_split_size) | ||
if getattr(block, "resnets", None) is not None: | ||
self._enable_split_inference_resnets_(block.resnets, temporal_split_size) | ||
if getattr(block, "downsamplers", None) is not None: | ||
self._enable_split_inference_samplers_(block.downsamplers, temporal_split_size) | ||
if getattr(block, "upsamplers", None) is not None: | ||
self._enable_split_inference_samplers_(block.upsamplers, temporal_split_size) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. Should attentions
, resnets
, etc. ne not configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this comment either. Basically, we're going to be splitting across the batch dimension for the layers based on chosen spatial_split_size
and temporal_split_size
values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly.
Basically, we're going to be splitting across the batch dimension for the layers based on chosen spatial_split_size and temporal_split_size values
Could it make sense to let the users to choose the kind of layers that wanna apply splitting? Perhaps we default to all (attentions
, motion_modules
, resnets
, downsamplers
, upsamplers
,) or not really?
hidden_states = torch.cat( | ||
[ | ||
torch.where(num_times_split > 0, accumulated_split / num_times_split, accumulated_split) | ||
for accumulated_split, num_times_split in zip( | ||
accumulated_values.split(self.context_length, dim=1), | ||
num_times_accumulated.split(self.context_length, dim=1), | ||
) | ||
], | ||
dim=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this seems to be a form of chunking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. At some point lowering the peaks on memory traces, torch.where became the bottleneck. This was actually first noticed by @DN6 so credits to him
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we know the situations where torch.where()
leads to spikes? Seems a little weird to me honestly because native conditionals like torch.where()
are supposed to be more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spike that we see is due to tensors being copied. The intermediate dimensions for attention get large when generating many frames (let's say, 200+) here. We could do something different here too - I just did what seemed like the easiest thing to do (as these changes were made when I was trying out different things in quick succession to golf the memory spikes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool. Let's perhaps make a note of this to reivsit later? At least this way, we are aware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, made a note. LMK if any further changes needed on this :)
@sayakpaul WDYT about the explanation of SplitInferenceModule now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks really good 👍🏽 Could you just add a fast GPU test to verify that the split inference outputs and normal outputs are the same?
@@ -70,6 +168,9 @@ def _enable_free_noise_in_block(self, block: Union[CrossAttnDownBlockMotion, Dow | |||
motion_module.transformer_blocks[i].load_state_dict( | |||
basic_transfomer_block.state_dict(), strict=True | |||
) | |||
motion_module.transformer_blocks[i].set_chunk_feed_forward( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always need chunked feed forward set when enabling free noise? Might be overkill no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only there to carry forward the chunk FF behaviour if and only if it was already enabled in the BasicTransformerBlock. Basically, if it was not enable in BTB, motion_module.transformer_blocks[i]._chunk_size
would be None leading to default behaviour of no chunking. If it was enabled in BTB, it would by default carry forward to FreeNoiseTransformerBlock
input_kwargs_to_split: List[str] = ["hidden_states"], | ||
) -> None: | ||
super().__init__() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also add a docstring here to explain the init arguments. Maybe the workflow example in forward can be moved up here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
for split_input in zip(*split_inputs.values()): | ||
inputs = dict(zip(split_inputs.keys(), split_input)) | ||
inputs.update(kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--------- Co-authored-by: yiyixuxu <[email protected]> Update `UNet2DConditionModel`'s error messages (#9230) * refactor [CI] Update Single file Nightly Tests (#9357) * update * update feedback. improve README for flux dreambooth lora (#9290) * improve readme * improve readme * improve readme * improve readme fix one uncaught deprecation warning for accessing vae_latent_channels in VaeImagePreprocessor (#9372) deprecation warning vae_latent_channels add mixed int8 tests and more tests to nf4. [core] Freenoise memory improvements (#9262) * update * implement prompt interpolation * make style * resnet memory optimizations * more memory optimizations; todo: refactor * update * update animatediff controlnet with latest changes * refactor chunked inference changes * remove print statements * update * chunk -> split * remove changes from incorrect conflict resolution * remove changes from incorrect conflict resolution * add explanation of SplitInferenceModule * update docs * Revert "update docs" This reverts commit c55a50a. * update docstring for freenoise split inference * apply suggestions from review * add tests * apply suggestions from review quantization docs. docs.
* quantization config. * fix-copies * fix * modules_to_not_convert * add bitsandbytes utilities. * make progress. * fixes * quality * up * up rotary embedding refactor 2: update comments, fix dtype for use_real=False (#9312) fix notes and dtype up up * minor * up * up * fix * provide credits where due. * make configurations work. * fixes * fix * update_missing_keys * fix * fix * make it work. * fix * provide credits to transformers. * empty commit * handle to() better. * tests * change to bnb from bitsandbytes * fix tests fix slow quality tests SD3 remark fix complete int4 tests add a readme to the test files. add model cpu offload tests warning test * better safeguard. * change merging status * courtesy to transformers. * move upper. * better * make the unused kwargs warning friendlier. * harmonize changes with huggingface/transformers#33122 * style * trainin tests * feedback part i. * Add Flux inpainting and Flux Img2Img (#9135) --------- Co-authored-by: yiyixuxu <[email protected]> Update `UNet2DConditionModel`'s error messages (#9230) * refactor [CI] Update Single file Nightly Tests (#9357) * update * update feedback. improve README for flux dreambooth lora (#9290) * improve readme * improve readme * improve readme * improve readme fix one uncaught deprecation warning for accessing vae_latent_channels in VaeImagePreprocessor (#9372) deprecation warning vae_latent_channels add mixed int8 tests and more tests to nf4. [core] Freenoise memory improvements (#9262) * update * implement prompt interpolation * make style * resnet memory optimizations * more memory optimizations; todo: refactor * update * update animatediff controlnet with latest changes * refactor chunked inference changes * remove print statements * update * chunk -> split * remove changes from incorrect conflict resolution * remove changes from incorrect conflict resolution * add explanation of SplitInferenceModule * update docs * Revert "update docs" This reverts commit c55a50a. * update docstring for freenoise split inference * apply suggestions from review * add tests * apply suggestions from review quantization docs. docs. * Revert "Add Flux inpainting and Flux Img2Img (#9135)" This reverts commit 5799954. * tests * don * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> * contribution guide. * changes * empty * fix tests * harmonize with huggingface/transformers#33546. * numpy_cosine_distance * config_dict modification. * remove if config comment. * note for load_state_dict changes. * float8 check. * quantizer. * raise an error for non-True low_cpu_mem_usage values when using quant. * low_cpu_mem_usage shenanigans when using fp32 modules. * don't re-assign _pre_quantization_type. * make comments clear. * remove comments. * handle mixed types better when moving to cpu. * add tests to check if we're throwing warning rightly. * better check. * fix 8bit test_quality. * handle dtype more robustly. * better message when keep_in_fp32_modules. * handle dtype casting. * fix dtype checks in pipeline. * fix warning message. * Update src/diffusers/models/modeling_utils.py Co-authored-by: YiYi Xu <[email protected]> * mitigate the confusing cpu warning --------- Co-authored-by: Vishnu V Jaddipal <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: YiYi Xu <[email protected]>
What does this PR do?
Memory improvements from #9231. The previous PR had too many changes so it has been split to make it easier to review. This PR contains just the FreeNoise memory improvements whereas the previous contains prompt travel support. This PR will be ready for review after 9231 has been merged and this has been rebased.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@DN6