diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index ae45906bc3c6..57b80ca54427 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -36,38 +36,42 @@ title: Push files to the Hub title: Loading & Hub - sections: - - local: using-diffusers/pipeline_overview - title: Overview - local: using-diffusers/unconditional_image_generation title: Unconditional image generation - local: using-diffusers/conditional_image_generation - title: Text-to-image generation + title: Text-to-image - local: using-diffusers/img2img - title: Text-guided image-to-image + title: Image-to-image - local: using-diffusers/inpaint - title: Text-guided image-inpainting + title: Inpainting - local: using-diffusers/depth2img - title: Text-guided depth-to-image + title: Depth-to-image + title: Tasks + - sections: - local: using-diffusers/textual_inversion_inference title: Textual inversion - local: training/distributed_inference title: Distributed inference with multiple GPUs - - local: using-diffusers/distilled_sd - title: Distilled Stable Diffusion inference - local: using-diffusers/reusing_seeds title: Improve image quality with deterministic generation - local: using-diffusers/control_brightness title: Control image brightness + - local: using-diffusers/weighted_prompts + title: Prompt weighting + title: Techniques + - sections: + - local: using-diffusers/pipeline_overview + title: Overview + - local: using-diffusers/sdxl + title: Stable Diffusion XL + - local: using-diffusers/distilled_sd + title: Distilled Stable Diffusion inference - local: using-diffusers/reproducibility title: Create reproducible pipelines - local: using-diffusers/custom_pipeline_examples title: Community pipelines - local: using-diffusers/contribute_pipeline title: How to contribute a community pipeline - - local: using-diffusers/stable_diffusion_jax_how_to - title: Stable Diffusion in JAX/Flax - - local: using-diffusers/weighted_prompts - title: Prompt weighting title: Pipelines for Inference - sections: - local: training/overview @@ -105,6 +109,8 @@ title: Memory and Speed - local: optimization/torch2.0 title: Torch2.0 support + - local: using-diffusers/stable_diffusion_jax_how_to + title: Stable Diffusion in JAX/Flax - local: optimization/xformers title: xFormers - local: optimization/onnx diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md index f6585f819928..e9fc9ae09380 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md @@ -10,414 +10,29 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Stable diffusion XL +# Stable Diffusion XL -Stable Diffusion XL was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach +Stable Diffusion XL (SDXL) was proposed in [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://huggingface.co/papers/2307.01952) by Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. -The abstract of the paper is the following: +The abstract from the paper is: *We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.* ## Tips -- Stable Diffusion XL works especially well with images between 768 and 1024. -- Stable Diffusion XL can pass a different prompt for each of the text encoders it was trained on as shown below. We can even pass different parts of the same prompt to the text encoders. -- Stable Diffusion XL output image can be improved by making use of a refiner as shown below. -- One can make use of `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to influence the generation process. - -### Available checkpoints: - -- *Text-to-Image (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with [`StableDiffusionXLPipeline`] -- *Image-to-Image / Refiner (1024x1024 resolution)*: [stabilityai/stable-diffusion-xl-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) with [`StableDiffusionXLImg2ImgPipeline`] - -## Usage Example - -Before using SDXL make sure to have `transformers`, `accelerate`, `safetensors` and `invisible_watermark` installed. -You can install the libraries as follows: - -``` -pip install transformers -pip install accelerate -pip install safetensors -``` - -### Watermarker - -We recommend to add an invisible watermark to images generating by Stable Diffusion XL, this can help with identifying if an image is machine-synthesised for downstream applications. To do so, please install -the [invisible-watermark library](https://pypi.org/project/invisible-watermark/) via: - -``` -pip install invisible-watermark>=0.2.0 -``` - -If the `invisible-watermark` library is installed the watermarker will be used **by default**. - -If you have other provisions for generating or deploying images safely, you can disable the watermarker as follows: - -```py -pipe = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False) -``` - -### Text-to-Image - -You can use SDXL as follows for *text-to-image*: - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipe(prompt=prompt).images[0] -``` - -You can additionally pass negative conditions about an image's size and position to avoid undesirable cropping behavior in the generated image, and improve image resolution. Let's take an example: - -```python -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipe( - prompt=prompt, - negative_original_size=(512, 512), - negative_crops_coords_top_left=(0, 0), - negative_target_size=(1024, 1024), -).images[0] -``` - -Here is a comparative example that shows the influence of using three `negative_original_size`s of -(128, 128), (256, 256), and (512, 512) respectively: - -![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/negative_conditions.png) - - - -One can use these negative conditions in the other SDXL pipelines ([Image-To-Image](#image-to-image), [Inpainting](#inpainting), [ControlNet](../controlnet_sdxl.md)) too! - - - -### Image-to-image - -You can use SDXL as follows for *image-to-image*: - -```py -import torch -from diffusers import StableDiffusionXLImg2ImgPipeline -from diffusers.utils import load_image - -pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe = pipe.to("cuda") -url = "https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/aa_xl/000000009.png" - -init_image = load_image(url).convert("RGB") -prompt = "a photo of an astronaut riding a horse on mars" -image = pipe(prompt, image=init_image).images[0] -``` - -### Inpainting - -You can use SDXL as follows for *inpainting* - -```py -import torch -from diffusers import StableDiffusionXLInpaintPipeline -from diffusers.utils import load_image - -pipe = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") - -prompt = "A majestic tiger sitting on a bench" -image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images[0] -``` - -### Refining the image output - -In addition to the [base model checkpoint](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), -StableDiffusion-XL also includes a [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) -that is specialized in denoising low-noise stage images to generate images of improved high-frequency quality. -This refiner checkpoint can be used as a "second-step" pipeline after having run the base checkpoint to improve -image quality. - -When using the refiner, one can easily -- 1.) employ the base model and refiner as an *Ensemble of Expert Denoisers* as first proposed in [eDiff-I](https://research.nvidia.com/labs/dir/eDiff-I/) or -- 2.) simply run the refiner in [SDEdit](https://arxiv.org/abs/2108.01073) fashion after the base model. - -**Note**: The idea of using SD-XL base & refiner as an ensemble of experts was first brought forward by -a couple community contributors which also helped shape the following `diffusers` implementation, namely: -- [SytanSD](https://github.com/SytanSD) -- [bghira](https://github.com/bghira) -- [Birch-san](https://github.com/Birch-san) -- [AmericanPresidentJimmyCarter](https://github.com/AmericanPresidentJimmyCarter) - -#### 1.) Ensemble of Expert Denoisers - -When using the base and refiner model as an ensemble of expert of denoisers, the base model should serve as the -expert for the high-noise diffusion stage and the refiner serves as the expert for the low-noise diffusion stage. - -The advantage of 1.) over 2.) is that it requires less overall denoising steps and therefore should be significantly -faster. The drawback is that one cannot really inspect the output of the base model; it will still be heavily denoised. - -To use the base model and refiner as an ensemble of expert denoisers, make sure to define the span -of timesteps which should be run through the high-noise denoising stage (*i.e.* the base model) and the low-noise -denoising stage (*i.e.* the refiner model) respectively. We can set the intervals using the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) of the base model -and [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) of the refiner model. - -For both `denoising_end` and `denoising_start` a float value between 0 and 1 should be passed. -When passed, the end and start of denoising will be defined by proportions of discrete timesteps as -defined by the model schedule. -Note that this will override `strength` if it is also declared, since the number of denoising steps -is determined by the discrete timesteps the model was trained on and the declared fractional cutoff. - -Let's look at an example. -First, we import the two pipelines. Since the text encoders and variational autoencoder are the same -you don't have to load those again for the refiner. - -```py -from diffusers import DiffusionPipeline -import torch - -base = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -base.to("cuda") - -refiner = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=base.text_encoder_2, - vae=base.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -) -refiner.to("cuda") -``` - -Now we define the number of inference steps and the point at which the model shall be run through the -high-noise denoising stage (*i.e.* the base model). - -```py -n_steps = 40 -high_noise_frac = 0.8 -``` - -Stable Diffusion XL base is trained on timesteps 0-999 and Stable Diffusion XL refiner is finetuned -from the base model on low noise timesteps 0-199 inclusive, so we use the base model for the first -800 timesteps (high noise) and the refiner for the last 200 timesteps (low noise). Hence, `high_noise_frac` -is set to 0.8, so that all steps 200-999 (the first 80% of denoising timesteps) are performed by the -base model and steps 0-199 (the last 20% of denoising timesteps) are performed by the refiner model. - -Remember, the denoising process starts at **high value** (high noise) timesteps and ends at -**low value** (low noise) timesteps. - -Let's run the two pipelines now. Make sure to set `denoising_end` and -`denoising_start` to the same values and keep `num_inference_steps` constant. Also remember that -the output of the base model should be in latent space: - -```py -prompt = "A majestic lion jumping from a big stone at night" - -image = base( - prompt=prompt, - num_inference_steps=n_steps, - denoising_end=high_noise_frac, - output_type="latent", -).images -image = refiner( - prompt=prompt, - num_inference_steps=n_steps, - denoising_start=high_noise_frac, - image=image, -).images[0] -``` - -Let's have a look at the images - -| Original Image | Ensemble of Denoisers Experts | -|---|---| -| ![lion_base_timesteps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_base.png) | ![lion_refined_timesteps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lion_refined.png) - -If we would have just run the base model on the same 40 steps, the image would have been arguably less detailed (e.g. the lion eyes and nose): +- SDXL works especially well with images between 768 and 1024. +- SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. +- SDXL output images can be improved by making use of a refiner model in an image-to-image setting. +- SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. -The ensemble-of-experts method works well on all available schedulers! - - - -#### 2.) Refining the image output from fully denoised base image - -In standard [`StableDiffusionImg2ImgPipeline`]-fashion, the fully-denoised image generated of the base model -can be further improved using the [refiner checkpoint](huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0). - -For this, you simply run the refiner as a normal image-to-image pipeline after the "base" text-to-image -pipeline. You can leave the outputs of the base model in latent space. +To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](/using-diffusers/sdxl) guide. -```py -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -refiner = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=pipe.text_encoder_2, - vae=pipe.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -) -refiner.to("cuda") - -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" - -image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0] -image = refiner(prompt=prompt, image=image[None, :]).images[0] -``` - -| Original Image | Refined Image | -|---|---| -| ![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/init_image.png) | ![](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/sd_xl/refined_image.png) | - - - -The refiner can also very well be used in an in-painting setting. To do so just make - sure you use the [`StableDiffusionXLInpaintPipeline`] classes as shown below +Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! -To use the refiner for inpainting in the Ensemble of Expert Denoisers setting you can do the following: - -```py -from diffusers import StableDiffusionXLInpaintPipeline -from diffusers.utils import load_image - -pipe = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -refiner = StableDiffusionXLInpaintPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=pipe.text_encoder_2, - vae=pipe.vae, - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", -) -refiner.to("cuda") - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") - -prompt = "A majestic tiger sitting on a bench" -num_inference_steps = 75 -high_noise_frac = 0.7 - -image = pipe( - prompt=prompt, - image=init_image, - mask_image=mask_image, - num_inference_steps=num_inference_steps, - denoising_start=high_noise_frac, - output_type="latent", -).images -image = refiner( - prompt=prompt, - image=image, - mask_image=mask_image, - num_inference_steps=num_inference_steps, - denoising_start=high_noise_frac, -).images[0] -``` - -To use the refiner for inpainting in the standard SDE-style setting, simply remove `denoising_end` and `denoising_start` and choose a smaller -number of inference steps for the refiner. - -### Loading single file checkpoints / original file format - -By making use of [`~diffusers.loaders.FromSingleFileMixin.from_single_file`] you can also load the -original file format into `diffusers`: - -```py -from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_single_file( - "./sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( - "./sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" -) -refiner.to("cuda") -``` - -### Memory optimization via model offloading - -If you are seeing out-of-memory errors, we recommend making use of [`StableDiffusionXLPipeline.enable_model_cpu_offload`]. - -```diff -- pipe.to("cuda") -+ pipe.enable_model_cpu_offload() -``` - -and - -```diff -- refiner.to("cuda") -+ refiner.enable_model_cpu_offload() -``` - -### Speed-up inference with `torch.compile` - -You can speed up inference by making use of `torch.compile`. This should give you **ca.** 20% speed-up. - -```diff -+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) -``` - -### Running with `torch < 2.0` - -**Note** that if you want to run Stable Diffusion XL with `torch` < 2.0, please make sure to enable xformers -attention: - -``` -pip install xformers -``` - -```diff -+pipe.enable_xformers_memory_efficient_attention() -+refiner.enable_xformers_memory_efficient_attention() -``` - ## StableDiffusionXLPipeline [[autodoc]] StableDiffusionXLPipeline @@ -435,25 +50,3 @@ pip install xformers [[autodoc]] StableDiffusionXLInpaintPipeline - all - __call__ - -### Passing different prompts to each text-encoder - -Stable Diffusion XL was trained on two text encoders. The default behavior is to pass the same prompt to each. But it is possible to pass a different prompt for each text-encoder, as [some users](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201) noted that it can boost quality. -To do so, you can pass `prompt_2` and `negative_prompt_2` in addition to `prompt` and `negative_prompt`. By doing that, you will pass the original prompts and negative prompts (as in `prompt` and `negative_prompt`) to `text_encoder` (in official SDXL 0.9/1.0 that is [OpenAI CLIP-ViT/L-14](https://huggingface.co/openai/clip-vit-large-patch14)), -and `prompt_2` and `negative_prompt_2` to `text_encoder_2` (in official SDXL 0.9/1.0 that is [OpenCLIP-ViT/bigG-14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)). - -```py -from diffusers import StableDiffusionXLPipeline -import torch - -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -) -pipe.to("cuda") - -# prompt will be passed to OAI CLIP-ViT/L-14 -prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -# prompt_2 will be passed to OpenCLIP-ViT/bigG-14 -prompt_2 = "monet painting" -image = pipe(prompt=prompt, prompt_2=prompt_2).images[0] -``` diff --git a/docs/source/en/using-diffusers/pipeline_overview.md b/docs/source/en/using-diffusers/pipeline_overview.md index ca98fc3f4b63..4ee25b51dc6f 100644 --- a/docs/source/en/using-diffusers/pipeline_overview.md +++ b/docs/source/en/using-diffusers/pipeline_overview.md @@ -12,6 +12,6 @@ specific language governing permissions and limitations under the License. # Overview -A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components. +A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components. -This section introduces you to some of the tasks supported by our pipelines such as unconditional image generation and different techniques and variations of text-to-image generation. You'll also learn how to gain more control over the generation process by setting a seed for reproducibility and weighting prompts to adjust the influence certain words in the prompt has over the output. Finally, you'll see how you can create a community pipeline for a custom task like generating images from speech. \ No newline at end of file +This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech. \ No newline at end of file diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md new file mode 100644 index 000000000000..4ca02a4cc2c5 --- /dev/null +++ b/docs/source/en/using-diffusers/sdxl.md @@ -0,0 +1,429 @@ +# Stable Diffusion XL + +[[open-in-colab]] + +[Stable Diffusion XL](https://huggingface.co/papers/2307.01952) (SDXL) is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways: + +1. the UNet is 3x larger and SDXL combines a second text encoder (OpenCLIP ViT-bigG/14) with the original text encoder to significantly increase the number of parameters +2. introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped +3. introduces a two-stage model process; the *base* model (can also be run as a standalone model) generates an image as an input to the *refiner* model which adds additional high-quality details + +This guide will show you how to use SDXL for text-to-image, image-to-image, and inpainting. + +Before you begin, make sure you have the following libraries installed: + +```py +# uncomment to install the necessary libraries in Colab +#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0 +``` + + + +We recommend installing the [invisible-watermark](https://pypi.org/project/invisible-watermark/) library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker: + +```py +pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False) +``` + + + +## Load model checkpoints + +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~StableDiffusionXLPipeline.from_pretrained`] method: + +```py +from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( + "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" +).to("cuda") +``` + +You can also use the [`~StableDiffusionXLPipeline.from_single_file`] method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally: + +```py +from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLImg2ImgPipeline.from_single_file( + "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16, use_safetensors=True, variant="fp16" +).to("cuda") +``` + +## Text-to-image + +For text-to-image, pass a text prompt: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline_text2image = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline(prompt=prompt).images[0] +``` + +
+ generated image of an astronaut in a jungle +
+ +## Image-to-image + +For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: + +```py +from diffusers import AutoPipelineForImg2Img +from diffusers.utils import load_image + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png" + +init_image = load_image(url).convert("RGB") +prompt = "a dog catching a frisbee in the jungle" +image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] +``` + +
+ generated image of a dog catching a frisbee in a jungle +
+ +## Inpainting + +For inpainting, you'll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace the masked area with. + +```py +from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image + +# use from_pipe to avoid consuming additional memory when loading a checkpoint +pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") + +img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" + +init_image = load_image(img_url).convert("RGB") +mask_image = load_image(mask_url).convert("RGB") + +prompt = "A deep sea diver floating" +image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] +``` + +
+ generated image of a deep sea diver in a jungle +
+ +## Refine image quality + +SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: + +1. use the base and refiner model together to produce a refined image +2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained) + +### Base + refiner model + +When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. + +As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: + +```py +from diffusers import DiffusionPipeline +import torch + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=base.text_encoder_2, + vae=base.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") +``` + +To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the [`denoising_end`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.denoising_end) parameter and for the refiner model, it is controlled by the [`denoising_start`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline.__call__.denoising_start) parameter. + + + +The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1. These parameters are represented as a proportion of discrete timesteps as defined by the scheduler. If you're also using the `strength` parameter, it'll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff. + + + +Let's set `denoising_end=0.8` so the base model performs the first 80% of denoising the **high-noise** timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the **low-noise** timesteps. The base model output should be in **latent** space instead of a PIL image. + +```py +prompt = "A majestic lion jumping from a big stone at night" + +image = base( + prompt=prompt, + num_inference_steps=40, + denoising_end=0.8, + output_type="latent", +).images +image = refiner( + prompt=prompt, + num_inference_steps=40, + denoising_start=0.8, + image=image, +).images[0] +``` + +
+
+ generated image of a lion on a rock at night +
base model
+
+
+ generated image of a lion on a rock at night in higher quality +
ensemble of expert denoisers
+
+
+ +The refiner model can also be used for inpainting in the [`StableDiffusionXLInpaintPipeline`]: + +```py +from diffusers import StableDiffusionXLInpaintPipeline +from diffusers.utils import load_image + +base = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = StableDiffusionXLInpaintPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=pipe.text_encoder_2, + vae=pipe.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") + +img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" +mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" + +init_image = load_image(img_url).convert("RGB") +mask_image = load_image(mask_url).convert("RGB") + +prompt = "A majestic tiger sitting on a bench" +num_inference_steps = 75 +high_noise_frac = 0.7 + +image = base( + prompt=prompt, + image=init_image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_end=high_noise_frac, + output_type="latent", +).images +image = refiner( + prompt=prompt, + image=image, + mask_image=mask_image, + num_inference_steps=num_inference_steps, + denoising_start=high_noise_frac, +).images[0] +``` + +This ensemble of expert denoisers method works well for all available schedulers! + +### Base to refiner model + +SDXL gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting. + +Load the base and refiner models: + +```py +from diffusers import DiffusionPipeline +import torch + +base = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +refiner = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-refiner-1.0", + text_encoder_2=pipe.text_encoder_2, + vae=pipe.vae, + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", +).to("cuda") +``` + +Generate an image from the base model, and set the model output to **latent** space: + +```py +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" + +image = base(prompt=prompt, output_type="latent").images[0] +``` + +Pass the generated image to the refiner model: + +```py +image = refiner(prompt=prompt, image=image[None, :]).images[0] +``` + +
+
+ generated image of an astronaut riding a green horse on Mars +
base model
+
+
+ higher quality generated image of an astronaut riding a green horse on Mars +
base model + refiner model
+
+
+ +For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. + +## Micro-conditioning + +SDXL training involves several additional conditioning techniques, which are referred to as *micro-conditioning*. These include original image size, target image size, and cropping parameters. The micro-conditionings can be used at inference time to create high-quality, centered images. + + + +You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance. They are available in the [`StableDiffusionXLPipeline`], [`StableDiffusionXLImg2ImgPipeline`], [`StableDiffusionXLInpaintPipeline`], and [`StableDiffusionXLControlNetPipeline`]. + + + +### Size conditioning + +There are two types of size conditioning: + +- [`original_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.original_size) conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data). This way, SDXL learns that upscaling artifacts are not supposed to be present in high-resolution images. During inference, you can use `original_size` to indicate the original image resolution. Using the default value of `(1024, 1024)` produces higher-quality images that resemble the 1024x1024 images in the dataset. If you choose to use a lower resolution, such as `(256, 256)`, the model still generates 1024x1024 images, but they'll look like the low resolution images (simpler patterns, blurring) in the dataset. + +- [`target_size`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline.__call__.target_size) conditioning comes from finetuning SDXL to support different image aspect ratios. During inference, if you use the default value of `(1024, 1024)`, you'll get an image that resembles the composition of square images in the dataset. We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options! + +🤗 Diffusers also lets you specify negative conditions about an image's size to steer generation away from certain image resolutions: + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_target_size=(1024, 1024), +).images[0] +``` + +
+ +
Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
+
+ +### Crop conditioning + +Images generated by previous Stable Diffusion models may sometimes appear to be cropped. This is because images are actually cropped during training so that all the images in a batch have the same size. By conditioning on crop coordinates, SDXL *learns* that no cropping - coordinates `(0, 0)` - usually correlates with centered subjects and complete faces (this is the default value in 🤗 Diffusers). You can experiment with different coordinates if you want to generate off-centered compositions! + +```py +from diffusers import StableDiffusionXLPipeline +import torch + + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0] +``` + +
+ generated image of an astronaut in a jungle, slightly cropped +
+ +You can also specify negative cropping coordinates to steer generation away from certain cropping parameters: + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +image = pipe( + prompt=prompt, + negative_original_size=(512, 512), + negative_crops_coords_top_left=(0, 0), + negative_target_size=(1024, 1024), +).images[0] +``` + +## Use a different prompt for each text-encoder + +SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts): + +```py +from diffusers import StableDiffusionXLPipeline +import torch + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True +).to("cuda") + +# prompt is passed to OAI CLIP-ViT/L-14 +prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" +# prompt_2 is passed to OpenCLIP-ViT/bigG-14 +prompt_2 = "Van Gogh painting" +image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] +``` + +
+ generated image of an astronaut in a jungle in the style of a van gogh painting +
+ +## Optimizations + +SDXL is a large model, and you may need to optimize memory to get it to run on your hardware. Here are some tips to save memory and speed up inference. + +1. Offload the model to the CPU with [`~StableDiffusionXLPipeline.enable_model_cpu_offload`] for out-of-memory errors: + +```diff +- base.to("cuda") +- refiner.to("cuda") ++ base.enable_model_cpu_offload ++ refiner.enable_model_cpu_offload +``` + +2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`): + +```diff ++ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) ++ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) +``` + +3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`: + +```diff ++ base.enable_xformers_memory_efficient_attention() ++ refiner.enable_xformers_memory_efficient_attention() +``` + +## Other resources + +If you're interested in experimenting with a minimal version of the [`UNet2DConditionModel`] used in SDXL, take a look at the [minSDXL](https://github.com/cloneofsimo/minSDXL) implementation which is written in PyTorch and directly compatible with 🤗 Diffusers. \ No newline at end of file