Can the SD-2-1 model (768 resolution) be fine-tuned with full-parameter finetuning? #9468

krahets · 2024-09-19T07:07:09Z

krahets
Sep 19, 2024

I conducted fine-tuning experiments on two base models, SD-2-1-base (512x512 resolution) and SD-2-1 (768x768 resolution), using identical hyperparameters:

"pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-1",  # Use "stabilityai/stable-diffusion-2-1-base" for the 512x512 model
"max_train_steps": 20000,
"train_batch_size": 1,
"gradient_accumulation_steps": 8,
"learning_rate": 3e-05,
"lr_scheduler": "constant",
"lr_warmup_steps": 0,
"max_grad_norm": 1,
"prediction_type": "v_prediction",
"mixed_precision": "bf16",
"gradient_checkpointing": true,
"enable_xformers_memory_efficient_attention": true,
"height": 768,
"width": 512,

Model architecture:

An image-conditioned model with extended input channels, where the conditioning image is concatenated with the noise (similar to the approach in Zero123), and the CLIP text encoder is disabled.

Dataset:

15k frames for a single person.

Results:

SD-2-1-base (512x512 resolution):

(Above: Ground Truth (GT), Below: Model Generation)

SD-2-1 (768x768 resolution):

(Above: Ground Truth (GT), Below: Model Generation)

It looks like the SD-2-1 (768x768 resolution) faced model collapse.

asomoza · 2024-09-19T14:32:58Z

asomoza
Sep 19, 2024
Maintainer

You can train SD 2.1 with 768px images and should work, probably you need to adapt the params for it.

One example of this is SPRIGHT that was trained with 768px images with this dataset.

cc: @sayakpaul if you have some more insights

10 replies

krahets Sep 25, 2024
Author

@bghira Thanks! Are you referring to this model sd2.1-base-zsnr-laionaes5?

bghira Sep 25, 2024

yes, it has a superior noise schedule to the original and was reparameterised to v-prediction from epsilon, completely bypassing issues from SD 2.1-v

krahets Sep 26, 2024
Author

@bghira Cool! I tested it with the prompts:

A young girl smiling

Woman lying on the grass

Children playing in the park on a sunny day

Students in the classroom

It seems that the model also struggles with accurately representing human anatomy.

Do you think it’s important to use a pretrained model with strong human priors? If so, could you recommend an option?

I don't know if my current thinking correct:

SD1.5/SD2.1: Poor human priors.
SD3: Struggling with accurately representing human anatomy.
Flux: According to your comments, it's not tunable for full-parameter fine-tuning.

By the way, thank you for sharing SimpleTuner. It’s been a great help to the community!

bghira Sep 26, 2024

haven't tried pixart or kolors?

bghira Sep 26, 2024

also SD2.1 needs long prompts, and for realism, ideally a location on the planet and a year

stunning photograph of a woman standing on the sidewalk in 1998 san francisco, kodachrome, vintage, influencer vibes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the SD-2-1 model (768 resolution) be fine-tuned with full-parameter finetuning? #9468

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Can the SD-2-1 model (768 resolution) be fine-tuned with full-parameter finetuning? #9468

krahets Sep 19, 2024

Replies: 1 comment · 10 replies

asomoza Sep 19, 2024 Maintainer

krahets Sep 25, 2024 Author

bghira Sep 25, 2024

krahets Sep 26, 2024 Author

bghira Sep 26, 2024

bghira Sep 26, 2024

krahets
Sep 19, 2024

Replies: 1 comment 10 replies

asomoza
Sep 19, 2024
Maintainer

krahets Sep 25, 2024
Author

krahets Sep 26, 2024
Author