Decent text-to-image generation results on CUB200 #131

kobiso · 2021-03-29T02:44:00Z

kobiso
Mar 29, 2021

DALLE on CUB200

I trained DALLE + VQGAN (Taming transformers) with CUB200 dataset and got decent generation results.
It is not so great as OpenAI DALLE, but i guess it is a good progress till now.
Hope you folks can get intuition from this setting, and improve the quality more :)
Pretrained model is available: https://github.com/kobiso/DALLE-reproduction

Main results

Text to image generation and re-ranking by CLIP

Check for more results: Decent text-to-image generation results on CUB200 #131 (comment)

Generate rest of image based on the given cropped image

Check for more results: Decent text-to-image generation results on CUB200 #131 (comment)

Model spec

VAE

Pretrained VQGAN

DALLE

dim = 256
text_seq_len = 80
depth = 8
heads = 8
dim_head = 64
reversible = 0
attn_types = (full,sparse)

Optimization

Optimizer: Adam
Learning rate: start with 0.00045 and perform ReduceLROnPlateau (PR: Add adamw and lr decay #138)
- Learning rate has to be scaled by the number of gpus: 0.00045 * gpus
Apply loss weighting (PR: Add loss weighting by following DALLE paper #134)
No gradient clipping for better loss convergence
- However, for the larger dataset, gradient clipping is necessary to avoid NaN.
Batch size: 110 * 8 (gpus)
- I used horovod for multi-gpus training

Results

Generation during training

First row: GT / Second row: generation / Third row: caption

Generation by input text

Training graph

Opinion

I guess VQGAN is better than OpenAI VAE because of less memory, less computational cost, and better generation quality.
As @afiaka87 mentioned in Finding Good Learning Rate For Different Values of Depth, Heads #84 and many transformer papers mentioned, learning rate is critical for minimizing loss and generation quality.
- Adam with LR 0.001 showed bad generation quality.
Even with the same text, some generalization quality is decent, but some are not.

TheodoreGalanos · 2021-03-29T02:49:06Z

TheodoreGalanos
Mar 29, 2021

Nice thanks for sharing, these might be the best we've seen so far! I'm curious about the text length, do you simply put at the maximum of your dataset or a maximum you cared about?

9 replies

kobiso Mar 29, 2021
Author

What kind of GPU and how much VRAM do you use?

afiaka87 Mar 29, 2021

You're saying the taming model performs better as well? Or do you just mean with regards to how much less memory it uses.

As far as I know they acheived something like 90% accuracy compared to OpenAI's VaE. An impressive result and very useful, but they do mention that it has some trouble reconstructing certain details.

I can maybe see this being true on images however if their training set skews heavily towards real photos.

Let me know if that's what you meant and if you have any concrete examples.

kobiso Mar 30, 2021
Author

This is the reconstruction image from VQGAN github.

I feel like DALLE is good at shape but blurry, while VQGAN is good at details but not great at shape.

I have tried both OpenAI's VAE and VQGAN to train DALLE, and i found VQGAN tends to generate images with better semantics and details. This is also related to the memory usage and computational cost because i couldn't finish training DALLE with OpenAI's VAE as it takes too long... Maybe if I could finish training DALLE with OpenAI's VAE, the quality could be better (not sure). However, I will stick to VQGAN because I don't have computational power like OpanAI's.

zhangyingbit Oct 11, 2022

@kobiso Thanks for sharing~ Can you supply the config of VQ-GAN in your training? such as the embed_dim, n_embed

zhangyingbit Oct 11, 2022

@kobiso I aslo have another confusion, why the generation image of DALLE-dVAE is so smooth compared with VQ-GAN? Can you provide some ideas?

afiaka87 · 2021-03-29T21:16:00Z

afiaka87
Mar 29, 2021

Incredible result! Definitely the first useful example posted that shows a clear ability to generalize outside the training set. New bar set! Thanks for sharing the loss graph! Very useful.

Did you find any learning rates which stood out as the clear winner? It was my experience that - for whatever reason - dalle-pytorch seemed to just kind of occassionally work well in the 1e-4 to 5e-4 range. Sometimes it was a complete failure and sometimes it worked fine - but generally 3e-4 was the best possible learning rate.

Now i'm revisiting the issue and i'm curious if I got my math wrong or something...

Also, you mention that it makes mistakes. Have you tried taking the top 32 of 512 generations reranked via CLIP as in the the OpenAI paper? I'm curious if that gets rid of the deformities.

4 replies

kobiso Mar 30, 2021
Author

Thanks! I agree with you regarding the learning rate. I only tried LR with 1e-3 and 4.5e-4 (DALLE paper's LR) and 1e-3 tends to fail training. But 4.5e-4 seems pretty stable. I shall try 3e-4 as well :)
I haven't tried CLIP reranking yet. I will share the results when when i do that.

afiaka87 Mar 30, 2021

Wow - so those results are just a single generation each - no re-ranking? Definitely update the post if you do a re-rank generation!

afiaka87 Mar 31, 2021

Aaand yep. Mind blown. Can you edit your top level comment to include the link to your new results?

@kobiso

kobiso Mar 31, 2021
Author

I did, thanks :)

JeniaJitsev · 2021-03-29T22:39:14Z

JeniaJitsev
Mar 29, 2021

Great step, thanks for sharing! We are thinking currently of replicating training on multi-node, multi-gpu, using deepspeed to have a future proof way to split a very large network across GPUs if simple data parallel scheme is not sufficient anymore for the model clones to fit each in single GPUs. However, using Horovod is also a good baseline to test data parallel mode with networks that are still small enough. Would you be interested to give your code a try on a number of V100 so that we (including @afiaka87 @lucidrains) could have a look on that together? We have at our machine (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html, at Juelich Supercomputing Center, Germany) Horovod already running, so that should be a quite an easy take to reproduce.

7 replies

kobiso Mar 30, 2021
Author

@JeniaJitsev That sounds good to do it on multi-node, cuz i only tried on multi-gpu. However, I can not share the current training code because it is work related. But I can definitely help you out regarding Horovod or I can try to run lucidrain's training code on Horovod.

@afiaka87 The training code i used is different from the lucidrain's training code, so the way of saving checkpoint is different. I will see if it can be loaded on lucidrain's training code and share :)

JeniaJitsev Mar 30, 2021

@kobiso you can get also into contact with @janEbert, who is now on getting the training setup on multi node, multi GPU. We are interested in getting it running with deepspeed, a working horovod baseline would be very helpful here as well.

janEbert Mar 30, 2021

Hey @kobiso, if it's not too much trouble to install, could you try to use my DeepSpeed branch to reproduce your results?
Just hit me up when something is not clear; though it's also my first time using DeepSpeed. The BATCH_SIZE is the effective batch size, so you may want to scale it up.

kobiso Mar 31, 2021
Author

Will do! i might be able to start from next week. I will share the results when i'm done :)

janEbert Mar 31, 2021

Thank you so much, would be very helpful to see whether it scales correctly!

TheodoreGalanos · 2021-03-30T06:15:35Z

TheodoreGalanos
Mar 30, 2021

@kobiso a quick question, which CUB200 dataset did you use? Was it the 2011 version with 11,788 images?

Thanks in advance!

3 replies

kobiso Mar 30, 2021
Author

Yeap! I used 2011 version CUB200: http://www.vision.caltech.edu/visipedia/CUB-200-2011.html

TheodoreGalanos Mar 30, 2021

Thanks! Can't tell you how happy I am to hear that a 12k image model seems to work lol. Going back to try my 3k image dataset.

wintersurvival Jul 22, 2021

Do we need split cub200 dataset into training set and test set before training?

rom1504 · 2021-03-30T07:27:37Z

rom1504
Mar 30, 2021

That's pretty cool!
What do you think about sharing the pretrained weights so people can experiment with the model a bit?

3 replies

abhi1nandy2 Mar 30, 2021

Also, could you share the test script, i.e., which takes in text input(s), and generates images

kobiso Mar 30, 2021
Author

I will share the pretrained weights soon, I just need some time to covert my model to the proper format in this repository.

yovizzle Mar 31, 2021

I'd really appreciate this too! :D

kobiso · 2021-03-30T15:26:43Z

kobiso
Mar 30, 2021
Author

Pretrained CLIP reranking

Images are generated 32 times with a single text, and re-ranked with the pretrained CLIP.
Value on each image indicates CLIP's probability
Results are pretty good 🛩️

Results

text: this colorful bird has a yellow breast , with a black crown and a black cheek patch

Click here for more results 🖱️

text: this green bird has a red crown and long pointed bill
text: the bird has a black crown to tail with a yellow belly and brown throat

7 replies

kobiso Mar 31, 2021
Author

Oh that would be great! I am working on DALLE + VQGAN with larger dataset (MSCOCO) now. And it seems easier than training with OpenAI VAE. Let them know VQGAN is awesome 🚀

lucidrains Mar 31, 2021
Maintainer

I'll tell them 🇩🇪 you said hello from 🇰🇷 :)

JeniaJitsev Mar 31, 2021

That is very valuable information also for our effort to drive training in multi node setting (@janEbert @mehdidc, #137) It seems from these preliminary results that taking VQ-GAN from Taming Transformers work as visual encoder makes training go way better than when taking original dVAE from OpenAI work. @lucidrains just to clarify - you were also pointing to VQ-GAN, or you indeed mean VQ-VAE for a cheap training loophole?

afiaka87 Mar 31, 2021

Let them know VQGAN is awesome

@lucidrains @kobiso

One of the first things I did! Throw them some more love if you can!

So thanks and great work everyone. You're awesome.
CompVis/taming-transformers#32

pesser Mar 31, 2021

just received the message :) glad to hear the VQGAN helps. This already looks really promising! 🤩 I'm excited to see where this will lead to 🚀

kobiso · 2021-03-30T15:34:56Z

kobiso
Mar 30, 2021
Author

Generate rest of image based on the given cropped image

Images are generated when 100 GT image tokens are given (total image token is 256).
It is something like openai's example below:

Results

Click here for more results 🖱️

1 reply

lucidrains Mar 30, 2021
Maintainer

Amazing! It is working :)

kobiso · 2021-04-02T14:27:16Z

kobiso
Apr 2, 2021
Author

CUB200 trained model sharing

As you folks asked, I'm sharing CUB200 trained models (sorry for late sharing!)
Since I used different training code, BPE tokenizer, and checkpoint format, I could not use generation code from this repository.
So, I created a temporal repository below, which includes text to image generation, pre-trained CLIP reranking, and generation from cropped image.
There are two models: 1) model trained with adam optimizer (details), 2) model trained with adamw optimizer (details).

https://github.com/kobiso/DALLE-reproduction

Hope you have fun with the models 🛩️

One more tips related to memory and performance

Training DALLE requires lots of VRAM memory and we do not have them like OpenAI 😞
So, I tried my best to minimize everything, including text vocabulary.
DALLE used 16,384 text vocabulary and CLIP used 49,152 text vocabulary size, when this repo is using CLIP's tokenizer.
However, I think their text vocabulary size are unnecessarily large for small experiments.
When we reduce the vocabulary size, we can reduce lots of memory consumption and also reduce the learning difficulty (cross-entropy loss will be easier with smaller vocabulary size)
That is why I trained custom BPE tokenizer with around 8000 vocabulary size.
Hope this is helpful 😄

25 replies

kobiso Apr 16, 2021
Author

@kswamy15
I believe the randomness occurs from here:

DALLE-pytorch/dalle_pytorch/dalle_pytorch.py

Lines 407 to 409 in ce0c892

    
           filtered_logits = top_k(logits, thres = filter_thres) 
        
           probs = F.softmax(filtered_logits / temperature, dim = -1) 
        
           sample = torch.multinomial(probs, 1)

@WormCoder
The captions are here: https://drive.google.com/file/d/1O_LtUP9sch09QH3s_EBAgLEctBQ5JBSJ/view

JimmyRaoUF Apr 21, 2021

Using dalle-pytorch = 0.7.2 fixed the issue with loading the Dalle model when I use cuda 10.1. The deepspeed is still a problem - it now throws other errors like 'Unable to JIT load the sparse_attn op due to it not being compatible due to hardware/software issue;. Then I looked up this issue in Deepspeed library and they say that 'Sparse Attention kernels are written in Triton and currently only work on Tesla V100; we will be soon upgrading to handle Ampere as well. However, it is not compatible with GeForce RTX.' I have a RTX 2080 GPU. Looks like I am out of luck in loading and playing with your model results - atleast on my machine, have to go to Google Colab then. Not sure why they would not make Deepspeed compatible with GeForce.

I am in the same boat, KOBISO uploaded a new model in his repo, without the requirement for DeepSpeed. But I am still facing error of RuntimeError: Error(s) in loading state_dict for DALLE:
Missing key(s) in state_dict:
Did you solve this problem? @kswamy15

kswamy15 Apr 22, 2021

I will try his new model without the requirement for DeepSpeed this weekend and will let you know how it went.

JimmyRaoUF Apr 22, 2021

I will try his new model without the requirement for DeepSpeed this weekend and will let you know how it went.

I have tried pip install dalle-pytorch==0.7.2, and restart. Now the keys matched. @kswamy15

gustmd0121 Nov 28, 2021

@kobiso Hello Thank you for your great work. I was wondering, is there any intuition behind using vocabulary size of 8000? I am experimenting on other datasets and was wondering if there's any rule-of-thumb u are using such as analyzing frequency distribution of
dataset tokens, etc... Thank you in advance!

yovizzle · 2021-04-04T12:38:03Z

yovizzle
Apr 4, 2021

Thank you @kobiso!!! Very much appreciated.

…

On Sat, 3 Apr 2021, 11:22 pm ByungSoo Ko, ***@***.***> wrote: @rom1504 <https://github.com/rom1504> I will check it next week! thanks for letting me know :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#131 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFNUREJ4WVLUTAVLMNKMCE3TG4B7ZANCNFSM4Z6W32FQ> .

0 replies

kobiso · 2021-04-21T14:45:06Z

kobiso
Apr 21, 2021
Author

Attention type ('full', 'axial_row', 'axial_col', 'conv_like') works

Above experiment is trained with attention type ('full', 'sparse'), where DeepSpeed sparse attention gives advantages of memory usage and training speed.
However, OpenAI's DALL-E includes row, column, convolutional attention to the model as below.
Thankfully, these attention types are implemented as in README.
So, I did an experiment with attention type ('full', 'axial_row', 'axial_col', 'conv_like').

Experimental setting

DALLE-pytorch version: 0.7.2
Attention type: ('full', 'axial_row', 'axial_col', 'conv_like')
Batch size: 32 * 7 (gpus)
Others are the same as Decent text-to-image generation results on CUB200 #131 (comment)
Pretrained model is in https://github.com/kobiso/DALLE-reproduction

Computational cost

Training speed: training speed was decreased by 38% compared to ('full', 'sparse')
Memory consumption: had to reduce from 110 ('full', 'sparse') to 32 ('full', 'axial_row', 'axial_col', 'conv_like') per gpu

Training log

Results

I feel like ('full', 'sparse') is little better than ('full', 'axial_row', 'axial_col', 'conv_like') on generation performance.
But still, ('full', 'axial_row', 'axial_col', 'conv_like') does work 👍

7 replies

afiaka87 Apr 21, 2021

@kobiso any chance we could get you to modify your checkpoint to be compatible with the latest DALLE-pytorch? You're working without quite a few optimizations/bug fixes at this point.

Happy to help - you just need to change the names of your keys to match the ones used in latest if I'm not mistaken.

phymhan Jun 7, 2021

Hi @kobiso,

Thanks for the great work! I followed the hyperparameters you posted and tried to reproduce results on CUB200, however, the generation I got during training is often like this:

The command I used for training is

python train_dalle.py --image_text_folder dataset/cub200 --batch_size 64 --depth 8 --text_seq_len 100 --dim_head 64 --dim 256 --attn_types 'full,axial_row,axial_col,conv_like' --learning_rate 0.00045 --taming --epochs 200 --lr_decay

I use a single A100 to train it and use batch size 64, I use the default YttmTokenizer. Do you have any suggestion or comment? Thank you so much again!

kobiso Jun 11, 2021
Author

Can you share the training logs?

phymhan Jun 11, 2021

Thanks a lot for your reply!

Here is the wandb logs:
https://wandb.ai/ligongh/dalle_cub200_aws?workspace=user-ligongh

I have also trained from scratch a model using a custom yttm tokenizer with vocab size of 8000 but results are similar, losses are around 2.2 also.

kobiso Jun 14, 2021
Author

Hmm, I can't tell much by the loss graph...
I recommend you to not use gradient clipping and also try to reduce learning rate to 0.0001.
I am not sure what the problem is cuz I haven't tried to train the model with recent lucidrains/DALLE-pytorch version.

krrishdholakia · 2021-06-03T18:37:11Z

krrishdholakia
Jun 3, 2021

Hi @kobiso, thanks for the work! I tried using the link provided: https://github.com/kobiso/DALLE-reproduction and it throws a 404 error. Any help would be appreciated!

2 replies

kobiso Jun 11, 2021
Author

@krrishdholakia I'm sorry. I had to make it private because of some issues.
Instead, you can try this repo to use pretrained models: https://github.com/robvanvolt/DALLE-models

robvanvolt Jun 11, 2021

You can use these models from https://github.com/robvanvolt/DALLE-models (or any other model) with the script rom created here: #288 (comment)

It works fantastic! Also, the models get updated soon providing much better results! ;-):)

wintersurvival · 2021-07-14T03:59:08Z

wintersurvival
Jul 14, 2021

2 replies

phymhan Jul 14, 2021

I used preprocessed data in AttnGAN (StackGAN also provides preprocessed texts).

wintersurvival Jul 15, 2021

Thanks. Solved the problem. The text file exists in birds.zip.

wintersurvival · 2021-07-21T09:28:57Z

wintersurvival
Jul 21, 2021

@kobiso @lucidrains Thank you for your great work!
Could you help me with this problem:
Using 8 GPU with Horovod, GPU 0 has additional 7 processes, making GPU 0 use much more memory than other GPUs. What cause that? Did I do anything wrong?

0 replies

wintersurvival · 2021-08-11T03:18:33Z

wintersurvival
Aug 11, 2021

@kobiso Thank you for your sharing!
Training loss of my reproduce with your config is much lower than loss in above curve, and the generated image is not as good as above.
What is reason of it?

0 replies

appliedml42 · 2021-09-07T15:32:51Z

appliedml42
Sep 7, 2021

@kobiso I have been trying to replicate your results. I am wondering if my configuration needs adjustment or do I just need to train for a very long time. I was wondering after how many steps in your setup did u start seeing decent results during training?

2 replies

SerezD Sep 8, 2021

Hi! Can you please share details of your conf ?

appliedml42 Sep 9, 2021

You can check out my configuration here: https://wandb.ai/appliedml85/storyteller/runs/3ph49b5r?workspace=user-appliedml85. Still trying to reproduce the great results from this discussion.

Decent text-to-image generation results on CUB200 #131

DALLE on CUB200

Main results

Text to image generation and re-ranking by CLIP

Generate rest of image based on the given cropped image

Model spec

VAE

DALLE

Optimization

Results

Generation during training

Generation by input text

Training graph

Opinion

Replies: 15 comments · 72 replies

kobiso Mar 29, 2021 Author

kobiso Mar 30, 2021 Author

kobiso Mar 30, 2021 Author

kobiso Mar 31, 2021 Author

kobiso Mar 30, 2021 Author

kobiso Mar 31, 2021 Author

kobiso Mar 30, 2021 Author

kobiso Mar 30, 2021 Author

kobiso Mar 30, 2021 Author

Pretrained CLIP reranking

Results

kobiso Mar 31, 2021 Author

lucidrains Mar 31, 2021 Maintainer

kobiso Mar 30, 2021 Author

Generate rest of image based on the given cropped image

Results

lucidrains Mar 30, 2021 Maintainer

kobiso Apr 2, 2021 Author

CUB200 trained model sharing

One more tips related to memory and performance

kobiso Apr 16, 2021 Author

kobiso Apr 21, 2021 Author

Attention type ('full', 'axial_row', 'axial_col', 'conv_like') works

Experimental setting

Computational cost

Training log

Results

kobiso Jun 11, 2021 Author

kobiso Jun 14, 2021 Author

kobiso Jun 11, 2021 Author

Replies: 15 comments 72 replies

kobiso Mar 29, 2021
Author

kobiso Mar 30, 2021
Author

kobiso Mar 30, 2021
Author

kobiso Mar 31, 2021
Author

kobiso Mar 30, 2021
Author

kobiso Mar 31, 2021
Author

kobiso Mar 30, 2021
Author

kobiso Mar 30, 2021
Author

kobiso
Mar 30, 2021
Author

kobiso Mar 31, 2021
Author

lucidrains Mar 31, 2021
Maintainer

kobiso
Mar 30, 2021
Author

lucidrains Mar 30, 2021
Maintainer

kobiso
Apr 2, 2021
Author

kobiso Apr 16, 2021
Author

kobiso
Apr 21, 2021
Author

kobiso Jun 11, 2021
Author

kobiso Jun 14, 2021
Author

kobiso Jun 11, 2021
Author