Questions regarding: DreamBooth, Low-level Noise Augmentation, & fine-tuning the super resolution components #901
Replies: 4 comments 5 replies
-
The model that was used in this article is different from the SD, although they are similar in principle. This can already be said in terms of learning opportunities as well as some incorrect operation of the prior preservation loss function. What we have now is just someone adapted the idea for sd training. If you simply train the model at a resolution higher than the base resolution of the model, it will allow you to generate images at that resolution. For example, when training at a resolution of 1024 on a base of 512, it allows you to generate normal images at a resolution of 1024 without a high-res fix. But an attempt to generate the object you trained already at a resolution of 512 will cause artifacts. At the moment, as far as I know, there is no way to change the resolution during the training itself. And if this is implemented, that is, there are questions about how to do it correctly, since the resolution determines the size of the matrices and if it is different, then most likely it will again cause artifacts during generation. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure the actual impact it would have on training, but one could
theoretically train the model at a max of 512, 768, 1024, etc with the same
dataset, as the images will be automatically downscaled to meet the
specified maximum resolution in the UI. by this process, one could probably
achieve the same effect as the upscaling bit.
…On Sun, Feb 5, 2023 at 11:53 AM cerega66 ***@***.***> wrote:
The model that was used in this article is different from the SD, although
they are similar in principle. This can already be said in terms of
learning opportunities as well as some incorrect operation of the prior
preservation loss function. What we have now is just someone adapted the
idea for sd training. If you simply train the model at a resolution higher
than the base resolution of the model, it will allow you to generate images
at that resolution. For example, when training at a resolution of 1024 on a
base of 512, it allows you to generate normal images at a resolution of
1024 without a high-res fix. But an attempt to generate the object you
trained already at a resolution of 512 will cause artifacts. At the moment,
as far as I know, there is no way to change the resolution during the
training itself. And if this is implemented, that is, there are questions
about how to do it correctly, since the resolution determines the size of
the matrices and if it is different, then most likely it will again cause
artifacts during generation.
May I know what you are training on and with what parameters that you
manage to set the batch size to 1768.
—
Reply to this email directly, view it on GitHub
<#901 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMO4NHMVY7XP5YLRIQZOBDWV7SJ7ANCNFSM6AAAAAAUR4A3FY>
.
You are receiving this because you are subscribed to this thread.Message
ID:
<d8ahazard/sd_dreambooth_extension/repo-discussions/901/comments/4875633@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
After reading a blog article: https://www.assemblyai.com/blog/how-imagen-actually-works/, I think I'm starting to understand what is actually happening during the training of the super resolution component. After training for the low resolution text-to-image model, the low resolution sample image is generated. Then, instead of only conditioning on the caption encoding, they also condition on the lower resolution sample image just generated. The process could the be repeated, with the self-attention layers removed and explicit cross-attention layers are added. I'm assuming they did this because the relational information from the caption has already been encoded in the lower resolution image, so the image does not need to be changed, and now the focus is on the actual objects in the picture. I guess I'll go read the full paper on Imagen, research how stabilityAI's upscale model works, and then see if I can find any info on training an upscaling model. Maybe down that path I'll find something that works. I guessing at some point I'll need to learn to program python, which is something I've been meaning to do anyway. |
Beta Was this translation helpful? Give feedback.
-
@minienglish1 when we use higher resolution images, don't they also get downscaled to base resolution of model during training? how is the process of higher resolution image training works? @d8ahazard |
Beta Was this translation helpful? Give feedback.
-
In the DreamBooth paper, they discuss training the model in two parts:
Can we do something similar to the "fine-tuning the super resolution components" with this extension? Particularly for fine-tuning a model, not necessarily using a unique token. I can't find a setting that adjusts noise augmentation.
In some of the models I'm training, I'm producing the good images that represent the training data. But everything seems so "smooth." It looks too much like a computer rendered person. I'd like to get some finer details in the images.
If there isn't a way to adjust noise, is there another method to increase fine details?
Lower learning rate? (I mostly use 1e-6, batch size 1, 768 - testing with even higher resolutions now)
Training on higher resolution photos than will be generated?
First train on a low resolution then train on higher resolution?
Close-up shots cropped from the original and trained with the original (if so, how to write the caption for the close-up with respect to the original)?
Thanks in advance.
Quotes from the paper: DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, to provide background for the questions:
Figure 4: Fine-tuning.
Given _ 3 ~ 5 images of a subject we fine tune a text-to-image diffusion in two steps:
(a) fine tuning the low-resolution text-to-image model with the input images paired with a text prompt containing a unique identifier and the name of the class the subject belongs to (e.g., “A [V] dog”), in parallel, we apply a class-specific prior preservation loss, which leverages the semantic prior that the model has on the class and encourages it to generate diverse instances belong to the subject’s class using the class name in a text prompt (e.g., “A dog”).
(b) fine-tuning the super resolution components with pairs of low-resolution and high-resolution images taken from our input images set, which enables us to maintain high-fidelity to small details of the subject.
4.3 Personalized Instance-Specific Super-Resolution
While the text-to-image diffusion model controls for most visual semantics, the super-resolution (SR) models are essential to achieve photorealistic content and to preserve subject instance details. We find that if SR networks are used without fine-tuning, the generated output can contain artifacts since the SR models might not be familiar with certain details or textures of the subject instance, or the subject instance might have hallucinated incorrect features, or missing details. Figure 14 (bottom row) shows some sample output images with no fine-tuning of SR models, where the model hallucinates some high-frequency details. We find that fine-tuning the 64_64 -> 256_256 SR model is essential for most subjects, and fine-tuning the 256_256 -> 1024 _ 1024 model can benefit some subject instances with high levels of fine-grained detail.
Low-level Noise Augmentation
We find results to be suboptimal if the training recipes and test parameters of Saharia et al. [56] are used to fine-tune the SR models with the given few shots of a subject instance. Specifically, we find that maintaining the original level of noise augmentation used to train the SR networks leads to the blurring of high-frequency patterns of the subject and of the environment. See Figure 14 (middle row) for sample generations. In order to faithfully reproduce the subject instance, we reduce the level of noise augmentation from 10^-3 to 10^-5 during fine-tuning of the 256_256 -> 1024 _ 1024 SR model. With this small modification, We are able to recover fine-grained details of the subject instance.
Beta Was this translation helpful? Give feedback.
All reactions