Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Experiments and Insights #14

Closed
BJQ123456 opened this issue Aug 15, 2024 · 10 comments
Closed

Training Experiments and Insights #14

BJQ123456 opened this issue Aug 15, 2024 · 10 comments

Comments

@BJQ123456
Copy link

感觉论文里面错误好多啊

@Pbihao
Copy link
Collaborator

Pbihao commented Aug 15, 2024

Hello, thanks for your feedback!

Yes, the first arXiv paper was a bit rushed and lacks some refinement. Since this project is still under development, it might not fully align with the paper at this stage.

We will continue to refine the paper until the final version is ready.

We would greatly appreciate any help you can offer in correcting it!

@BJQ123456
Copy link
Author

Thanks for your reply. I have a few questions:

1.The text mentions "plug-and-play," but the base model was trained. Isn't this contradictory?

2.Is there no comparison with methods like ControlNet?

3.Formulas (8) and (9) seem incorrect.

4.Figure 5 does not explain what the three results represent.

5.Figure 8 is unclear.

6.Figure 9 does not explain the conditions of the experiment on the left side.

@Pbihao
Copy link
Collaborator

Pbihao commented Aug 15, 2024

Hello, thanks for your question, and I think that they are all good questions! I will share more details.

  1. One of the most important findings is that directly training the base model yields better performance compared to methods like LoRA, Adapter, and others.Even when we train the base model, we only select a small subset of the pre-trained parameters and do not conflict with the 'plug and play' concept. You can think of it as a specialized version of LoRA—more direct and straightforward.
    1.1. We would like to share more experiences.
    As I’ve mentioned, we only select a small subset of parameters, which is fully adapted to the SD1.5 and SDXL backbones. By training fewer than 100 million parameters, we still achieve excellent performance. But this is is not suitable for the SD3 and SVD training. This is because, after SDXL, Stability faced significant legal risks due to the generation of highly realistic human images. After that, they stopped refining their models on human-related data, such as SVD and SD3, to avoid potential risks.
    To achieve optimal performance, it's necessary to first continue training SVD and SD3 on human-related data to develop a robust backbone before fine-tuning. Of course, you can also combine the continual pretraining and finetuning. So you can find that we direct provide the full SVD parameters.
    Although this may not be directly related to academia, it is crucial for achieving good performance.

@Pbihao
Copy link
Collaborator

Pbihao commented Aug 15, 2024

Since the paper has limited space, I would like to share additional experiences. Can I change the title to 'Training Experiments and Insights'? @BJQ123456

SVD-related

  1. Data
    Due to privacy policies, we are unable to share the data. However, data quality is crucial, as many videos on the internet are highly compressed. It’s important to focus on collecting high-quality data.
  2. Pose alignment.
    Thanks mimic. SVD performs poorly, especially with large motions. Therefore, it is important to avoid large movements and shifts. So please note that in preprocess, there is a alignment between the refenrece image and pose. This is crucial.
  3. Hands
    Generating hands is a challenging problem in both video and image generation. To address this, we focus on the following strategies:
    a. Use clear and high-quality data, which is crucial for accurate generation.
    b. Since the hands occupy a relatively small area, we apply a larger scale for the loss function specifically for this region to improve the generation quality.
  4. Magic number
    You can find that we adopt a magic nuber when adding the conditions. You can adjust this.

We spent a lot of time to find these tips, now share all with all of you. May there will help you!

@Pbihao
Copy link
Collaborator

Pbihao commented Aug 15, 2024

2.We have compared the efficiency and training convergence. More detailed results will be added later.
3.Have been corrected.
4.5.6.: Thanks, we will refine these details

@BJQ123456 BJQ123456 changed the title error Training Experiments and Insights Aug 15, 2024
@BJQ123456
Copy link
Author

Thank you for your response; it was very helpful.

@BJQ123456
Copy link
Author

There's one point I still don't understand. If it's plug-and-play, does it mean that after training, I can directly use it on odels? But during training, a part of the base model was trained, so if I insert it into a new model, that part hasn't been trained and could lead to a performance drop, right?
thanks for your reply

@Pbihao
Copy link
Collaborator

Pbihao commented Aug 15, 2024

Yes, as demonstrated in our experiments with SD1.5 and SDXL, we trained on a single backbone and then conducted experiments across various backbones. The results show that our method effectively performs control on different backbones.

We also considered this approach and initially attempted to store the weight increments, same as LoRA only without low-rank compression. However, we eventually found that this step was unnecessary.

@BJQ123456
Copy link
Author

I get it, thanks

@Pbihao Pbihao closed this as completed Aug 19, 2024
@nighting0le01
Copy link

hi could the authors also share light on the combining this with IP-ADAPTERS? will it cause any issues making it work with IP-adapters? does it work well with pretrained IP-Adapters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants