Questions regarding: DreamBooth, Low-level Noise Augmentation, & fine-tuning the super resolution components #901

minienglish1 · 2023-02-05T17:09:32Z

minienglish1
Feb 5, 2023

In the DreamBooth paper, they discuss training the model in two parts:

fine tuning the low-resolution text-to-image model with the input images paired with a text prompt while in parallel apply class-specific prior preservation loss
fine-tuning the super resolution components with pairs of low-resolution and high-resolution images, during which they use a reduced noise augmentation

Can we do something similar to the "fine-tuning the super resolution components" with this extension? Particularly for fine-tuning a model, not necessarily using a unique token. I can't find a setting that adjusts noise augmentation.

In some of the models I'm training, I'm producing the good images that represent the training data. But everything seems so "smooth." It looks too much like a computer rendered person. I'd like to get some finer details in the images.

If there isn't a way to adjust noise, is there another method to increase fine details?
Lower learning rate? (I mostly use 1e-6, batch size 1, 768 - testing with even higher resolutions now)
Training on higher resolution photos than will be generated?
First train on a low resolution then train on higher resolution?
Close-up shots cropped from the original and trained with the original (if so, how to write the caption for the close-up with respect to the original)?

Thanks in advance.

Quotes from the paper: DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, to provide background for the questions:

Figure 4: Fine-tuning.

Given _ 3 ~ 5 images of a subject we fine tune a text-to-image diffusion in two steps:
(a) fine tuning the low-resolution text-to-image model with the input images paired with a text prompt containing a unique identifier and the name of the class the subject belongs to (e.g., “A [V] dog”), in parallel, we apply a class-specific prior preservation loss, which leverages the semantic prior that the model has on the class and encourages it to generate diverse instances belong to the subject’s class using the class name in a text prompt (e.g., “A dog”).
(b) fine-tuning the super resolution components with pairs of low-resolution and high-resolution images taken from our input images set, which enables us to maintain high-fidelity to small details of the subject.

4.3 Personalized Instance-Specific Super-Resolution

While the text-to-image diffusion model controls for most visual semantics, the super-resolution (SR) models are essential to achieve photorealistic content and to preserve subject instance details. We find that if SR networks are used without fine-tuning, the generated output can contain artifacts since the SR models might not be familiar with certain details or textures of the subject instance, or the subject instance might have hallucinated incorrect features, or missing details. Figure 14 (bottom row) shows some sample output images with no fine-tuning of SR models, where the model hallucinates some high-frequency details. We find that fine-tuning the 64_64 -> 256_256 SR model is essential for most subjects, and fine-tuning the 256_256 -> 1024 _ 1024 model can benefit some subject instances with high levels of fine-grained detail.

Low-level Noise Augmentation

We find results to be suboptimal if the training recipes and test parameters of Saharia et al. [56] are used to fine-tune the SR models with the given few shots of a subject instance. Specifically, we find that maintaining the original level of noise augmentation used to train the SR networks leads to the blurring of high-frequency patterns of the subject and of the environment. See Figure 14 (middle row) for sample generations. In order to faithfully reproduce the subject instance, we reduce the level of noise augmentation from 10^-3 to 10^-5 during fine-tuning of the 256_256 -> 1024 _ 1024 SR model. With this small modification, We are able to recover fine-grained details of the subject instance.

cerega66 · 2023-02-05T17:53:25Z

cerega66
Feb 5, 2023

The model that was used in this article is different from the SD, although they are similar in principle. This can already be said in terms of learning opportunities as well as some incorrect operation of the prior preservation loss function. What we have now is just someone adapted the idea for sd training. If you simply train the model at a resolution higher than the base resolution of the model, it will allow you to generate images at that resolution. For example, when training at a resolution of 1024 on a base of 512, it allows you to generate normal images at a resolution of 1024 without a high-res fix. But an attempt to generate the object you trained already at a resolution of 512 will cause artifacts. At the moment, as far as I know, there is no way to change the resolution during the training itself. And if this is implemented, that is, there are questions about how to do it correctly, since the resolution determines the size of the matrices and if it is different, then most likely it will again cause artifacts during generation.

1 reply

minienglish1 Feb 6, 2023
Author

I remember reading that the implemented SD was different from the one used in the paper. Also, my experience with training higher resolutions has been exactly what you said. I'll continue working on training a model on higher resolutions and see where that will take me.

Thanks for your help.

d8ahazard · 2023-02-05T20:06:22Z

d8ahazard
Feb 5, 2023
Maintainer

I'm not sure the actual impact it would have on training, but one could theoretically train the model at a max of 512, 768, 1024, etc with the same dataset, as the images will be automatically downscaled to meet the specified maximum resolution in the UI. by this process, one could probably achieve the same effect as the upscaling bit.

…

On Sun, Feb 5, 2023 at 11:53 AM cerega66 ***@***.***> wrote: The model that was used in this article is different from the SD, although they are similar in principle. This can already be said in terms of learning opportunities as well as some incorrect operation of the prior preservation loss function. What we have now is just someone adapted the idea for sd training. If you simply train the model at a resolution higher than the base resolution of the model, it will allow you to generate images at that resolution. For example, when training at a resolution of 1024 on a base of 512, it allows you to generate normal images at a resolution of 1024 without a high-res fix. But an attempt to generate the object you trained already at a resolution of 512 will cause artifacts. At the moment, as far as I know, there is no way to change the resolution during the training itself. And if this is implemented, that is, there are questions about how to do it correctly, since the resolution determines the size of the matrices and if it is different, then most likely it will again cause artifacts during generation. May I know what you are training on and with what parameters that you manage to set the batch size to 1768. — Reply to this email directly, view it on GitHub <#901 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMO4NHMVY7XP5YLRIQZOBDWV7SJ7ANCNFSM6AAAAAAUR4A3FY> . You are receiving this because you are subscribed to this thread.Message ID: <d8ahazard/sd_dreambooth_extension/repo-discussions/901/comments/4875633@ github.com>

1 reply

minienglish1 Feb 6, 2023
Author

I've been wondering about that. Lots of the models I've read about were first trained at low resolutions and then trained again at higher resolutions or on a curated dataset. Its a good direction to start exploring once I get the basics down of building a good dataset and setting parameters for large scale model fine-tuning.

Thanks for the help as well as taking time to build this extension. You must be exhausted.

minienglish1 · 2023-02-06T07:40:07Z

minienglish1
Feb 6, 2023
Author

After reading a blog article: https://www.assemblyai.com/blog/how-imagen-actually-works/, I think I'm starting to understand what is actually happening during the training of the super resolution component.

After training for the low resolution text-to-image model, the low resolution sample image is generated. Then, instead of only conditioning on the caption encoding, they also condition on the lower resolution sample image just generated.

The process could the be repeated, with the self-attention layers removed and explicit cross-attention layers are added. I'm assuming they did this because the relational information from the caption has already been encoded in the lower resolution image, so the image does not need to be changed, and now the focus is on the actual objects in the picture.

I guess I'll go read the full paper on Imagen, research how stabilityAI's upscale model works, and then see if I can find any info on training an upscaling model. Maybe down that path I'll find something that works. I guessing at some point I'll need to learn to program python, which is something I've been meaning to do anyway.

3 replies

cerega66 Feb 6, 2023

I think it would be useful to make a full comparison of the learning algorithms of sd and imagen. That's just, as far as I know, imagen is not in the public domain as sd, which makes it difficult to compare them. Understanding the differences and similarities would not only make it possible to add scalability with an increase in detail, but also make it possible to fix prior preservation loss. It is also possible that perhaps this will give us the opportunity to consistently add concepts to the same model. I already have some algorithm implemented through prior preservation loss, but due to the fact that it does not work correctly, other problems arise there.

Perhaps in the future we will see the same things in SD, since they also seemed to want to switch to T5.

cerega66 Feb 6, 2023

In the source that you indicated, it is pretty well described how generation is implemented through Super-Resolution Models. In our case it would look like this:
We have 3 models
1.trained at 64x64 resolution
2.trained at 256x256 resolution
3.trained at 1024x1024 resolution

And when we generate something, the primary generation goes to 64x64, then through the second model the resolution rises to 256 and at the end rises to 1024 with the help of the third. In the case of sd, it would be 3 separate models that would need to be loaded sequentially, but, as I understand it, for imagen, this is all one model. That is, it has 3 sub models that share data, which allows them to produce a high resolution 1024x1024 image based on a 64x64 image.
The article explicitly states that each subsequent model depends on the previous one. That is, the base one is still 64x64, and the other two are additional.
And, as far as I know, SD does not have such an implementation at the moment.
One could say that upscales do something similar, but there is a completely different algorithm, but the idea is similar. We generate an image in the original resolution, after which we divide it into sections and increase their resolution, after which we give them to our model so that it generates similar sections of better quality, after which we combine them into one image.
If we had a similar model of higher quality, then we would just have to give it an resize image from a lower resolution model.

minienglish1 Feb 6, 2023
Author

I'll start learning what I can. Maybe one day I can help you guys out with some of these problems. Thanks for all the help.

FurkanGozukara · 2023-03-02T11:46:47Z

FurkanGozukara
Mar 2, 2023

@minienglish1 when we use higher resolution images, don't they also get downscaled to base resolution of model during training?
otherwise vector sizes wouldn't be needed to change?
am I mistaken?

how is the process of higher resolution image training works? @d8ahazard

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding: DreamBooth, Low-level Noise Augmentation, & fine-tuning the super resolution components #901

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions regarding: DreamBooth, Low-level Noise Augmentation, & fine-tuning the super resolution components #901

minienglish1 Feb 5, 2023

Replies: 4 comments · 5 replies

cerega66 Feb 5, 2023

minienglish1 Feb 6, 2023 Author

d8ahazard Feb 5, 2023 Maintainer

minienglish1 Feb 6, 2023 Author

minienglish1 Feb 6, 2023 Author

cerega66 Feb 6, 2023

cerega66 Feb 6, 2023

minienglish1 Feb 6, 2023 Author

FurkanGozukara Mar 2, 2023

minienglish1
Feb 5, 2023

Replies: 4 comments 5 replies

cerega66
Feb 5, 2023

minienglish1 Feb 6, 2023
Author

d8ahazard
Feb 5, 2023
Maintainer

minienglish1 Feb 6, 2023
Author

minienglish1
Feb 6, 2023
Author

minienglish1 Feb 6, 2023
Author

FurkanGozukara
Mar 2, 2023