feat: allow flux transformer to be sharded during inference #9159

sayakpaul · 2024-08-12T11:45:04Z

What does this PR do?

Adds support to shard the Flux transformer across multiple devices.

Here's how to run it:

generate_embeddings.py

from diffusers import FluxPipeline
import torch

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/flux.1-dev", transformer=None, vae=None, torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

prompt = "a cute fish holding a sign saying 'hello world!'"

with torch.no_grad():
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt, prompt_2=None, max_sequence_length=512, num_images_per_prompt=4
    )

torch.save(prompt_embeds, "prompt_embeds.pt")
torch.save(pooled_prompt_embeds, "pooled_prompt_embeds.pt")
torch.save(text_ids, "prompt_attention_mask.pt")

This will generate and serialize the embeddings to the disk.

run_denoising_loop.py

from diffusers import FluxTransformer2DModel, FluxPipeline
import torch 

max_memory = {0: "16GB", 1: "16GB"}
ckpt_id = "black-forest-labs/flux.1-dev"
model = FluxTransformer2DModel.from_pretrained(
    ckpt_id, 
    subfolder="transformer",
    device_map="auto",
    max_memory=max_memory, 
    torch_dtype=torch.bfloat16,
)
pipeline = FluxPipeline.from_pretrained(
    ckpt_id,
    transformer=model,
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    vae=None,
    torch_dtype=torch.bfloat16,
)

height, width = 768, 1360
latents = pipeline(
    prompt_embeds=torch.load("prompt_embeds.pt"),
    pooled_prompt_embeds=torch.load("pooled_prompt_embeds.pt"),
    num_inference_steps=50,
    guidance_scale=3.5,
    height=height,
    width=width,
    output_type="latent",
).images
print(f"{latents.shape=}")
torch.save(latents, "latents.pt")

This does the following:

Shows how to distribute the Flux transformer into two GPUs (mimicking 16GBs for each).
Run the denoising loop with the serialized text embeddings.
Serializes the final latents for decoding.

decode_latents.py

from diffusers import FluxPipeline, AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
import torch

vae = AutoencoderKL.from_pretrained("black-forest-labs/flux.1-dev", subfolder="vae", torch_dtype=torch.bfloat16).to(
    "cuda"
)

latents = torch.load("latents.pt")
height, width = 768, 1360
vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)

latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor

with torch.no_grad():
    image = vae.decode(latents, return_dict=False)[0]
image = image_processor.postprocess(image, output_type="pil")
image[0].save("image.png")

We get:

("a cute fish holding a sign saying 'hello world!'")

The tests were run with the following command:

 CUDA_VISIBLE_DEVICES=0,1 pytest tests/models/transformers/test_models_transformer_flux.py -k "offload"

HuggingFaceDocBuilderDev · 2024-08-12T11:50:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Nice ! You mean CUDA_VISIBLE_DEVICES=0,1 pytest tests/models/transformers/test_models_transformer_flux.py -k "offload" instead no ? Otherwise, it is only using one gpu

sayakpaul · 2024-08-13T15:05:01Z

@SunMarc that's right!

notdanilo · 2024-08-13T20:52:30Z

*** Looking for the merge button ***

Damn! I don't have write access.

DN6

LGTM 👍🏽

sayakpaul · 2024-08-16T04:30:48Z

Failing test is unrelated.

sayakpaul added 2 commits August 12, 2024 17:02

feat: support sharding for flux.

b516b0f

tests

704c31e

sayakpaul requested review from DN6 and SunMarc August 12, 2024 11:45

Merge branch 'main' into support-flux-sharding

50fdb50

SunMarc approved these changes Aug 13, 2024

View reviewed changes

Merge branch 'main' into support-flux-sharding

9f1553e

sayakpaul mentioned this pull request Aug 14, 2024

flux.1-dev device_map didn't work #9127

Open

Merge branch 'main' into support-flux-sharding

ae786bc

DN6 approved these changes Aug 16, 2024

View reviewed changes

Merge branch 'main' into support-flux-sharding

8d3944f

sayakpaul merged commit 39b87b1 into main Aug 16, 2024
17 of 18 checks passed

sayakpaul deleted the support-flux-sharding branch August 16, 2024 04:30

asomoza mentioned this pull request Aug 16, 2024

Problem with Flux Schnell bfloat16 multiGPU #9195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow flux transformer to be sharded during inference #9159

feat: allow flux transformer to be sharded during inference #9159

sayakpaul commented Aug 12, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 12, 2024

SunMarc left a comment •

edited

Loading

sayakpaul commented Aug 13, 2024

notdanilo commented Aug 13, 2024

DN6 left a comment

sayakpaul commented Aug 16, 2024

feat: allow flux transformer to be sharded during inference #9159

feat: allow flux transformer to be sharded during inference #9159

Conversation

sayakpaul commented Aug 12, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Aug 12, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

sayakpaul commented Aug 13, 2024

notdanilo commented Aug 13, 2024

DN6 left a comment

Choose a reason for hiding this comment

sayakpaul commented Aug 16, 2024

sayakpaul commented Aug 12, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading