Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More "OpenAI Blog Post" Training | Depth 32 | Heads 8 | LR 5e-4 #86

Closed
afiaka87 opened this issue Mar 15, 2021 · 31 comments
Closed

More "OpenAI Blog Post" Training | Depth 32 | Heads 8 | LR 5e-4 #86

afiaka87 opened this issue Mar 15, 2021 · 31 comments

Comments

@afiaka87
Copy link
Contributor

afiaka87 commented Mar 15, 2021

Edit: Moved to discussions: #106

Hey, all. Some of you might know I'm practicing and learning about machine learning with dalle-pytorch and a dataset consisting of the images OpenAI presented in the DALLE blog post. I honestly dont have the money to train this whole dataset,

edit: this is no longer true. Using the 1024 VQGAN from the "Taming Transformers" research, it's now quite possible to train a full dataset of 1,000,000 image-text pairs and i'm doing just that. I hope to have it finished in about a week. I assume someone else will release a dalle-pytorch trained properly on COCO and other image sets before then, but if they dont, check here for updates.

Anway, it ran for ~36000 steps. As you can see it...still really likes mannequins. I'm considering removing them from the dataset. But also, you'll notice that the network has actually got a decent idea of the sort of general colors that belong in types of prompts.

Some Samples from Near the End of Training

results

Every Text-Image Reconstruction

https://wandb.ai/afiaka87/dalle_pytorch_live_training/reports/dalle-pytorch-Test-Run-2--Vmlldzo1MzM5MjQ

Deliverables (my train_dalle.py)

https://gist.github.com/afiaka87/850fb3cc48edde8a7ed4cb1ce53b6bd2

This has some code in it that actually manages to deal with truncated images via Try Catch. Apparently detecting a corrupted PNG is harder than P vs NP. PIL's imverify() function doesnt catch all of them. Python's built in imghdr library doesn't catch all of them either. So you just sort of catch OSError and return an item further along. Works well enough.

Parameters

SHUFFLE = True
EPOCHS = 28 # This wound up being less than a single epoch, of course. 
BATCH_SIZE = 16
LEARNING_RATE = 0.0005 # I found this learning rate to be more suitable than 0.0003 in my hyperparameter sweep post
GRAD_CLIP_NORM = 0.5
DEPTH = 32
HEADS = 8
MODEL_DIM = 512
TEXT_SEQ_LEN = 256
DIM_HEAD = 64
REVERSIBLE = True,
ATTN_TYPES = ('full')

Dataset Description

#61 (comment)

Just for more info on the dataset itself, it is roughly 1,100,000 256x256 image-text pairs that were generated by OpenAI's DALL-E. They presented roughly ~30k unique text prompts of which they posted the top 32 of 512 generations on https://openai.com/blog/dall-e/. Many images were corrupt, and not every prompt has a full 32 examples, but the total number of images winds up being about 1.1 million. If you look at many of the examples on that page, you'll see that DALL-E (in that form at least), can and will make mistakes. These mistakes are also in this dataset. Anyway I'm just messing around having fun training and what not. This is definitely not going to produce a good model or anything.

There are also a large number of images in the dataset which are intended to be used with the "mask" feature. I don't know if that's possible yet in DALLE-pytorch though. Anyway, that can't be helping much.

@afiaka87 afiaka87 changed the title Training on Depth=32, Heads=8 on Dataset ~million openai Blog Post images More OpenAI Blog Image-Text Pairs Training (Depth=32, Heads=8, Unfinished=Definitely) Mar 15, 2021
@afiaka87 afiaka87 changed the title More OpenAI Blog Image-Text Pairs Training (Depth=32, Heads=8, Unfinished=Definitely) More "OpenAI Blog Post" Training Mar 15, 2021
@afiaka87 afiaka87 changed the title More "OpenAI Blog Post" Training More "OpenAI Blog Post" Training | Depth 32 | Heads 8 | LR 5e-4 Mar 15, 2021
@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 15, 2021

@lucidrains btw, i've been going through the wandb.ai docs and found some nice extras you can add to train_dalle.py that will give you live updates on the transformer itself:

config = wandb.config
config.depth = DEPTH
config.heads = HEADS
config.dim_head = DIM_HEAD
config.learning_rate = LEARNING_RATE
config.shuffle = SHUFFLE
config.resume = RESUME
config.batch_size = BATCH_SIZE
config.grad_clip_norm = GRAD_CLIP_NORM
config.reversible = REVERSIBLE
config.model_dim = MODEL_DIM
config.attn_types = ATTN_TYPES

wandb.init(project = PROJECT_NAME, resume = RESUME)

wandb.watch(dalle) # Updates a graph of gradients on wandb as soon as your model changes

In particular, that very last line is actually all you need to add. But attaching all the parameters in the way I did also allows it to track those better and you can more easily create hyperparameter sweeps from existing projects when you do.

@lucidrains
Copy link
Owner

@afiaka87 ohh got it! i'm circling back to DALL-E this week for some final instrumentations :) i'll be sure to add that! 🙏

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 15, 2021

@lucidrains Awesome, looking forward to it! Thanks for patching up big-sleep/deep-daze btw. I tried but I'm so distracted with this project now lol.

@lucidrains
Copy link
Owner

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@afiaka87 afiaka87 reopened this Mar 15, 2021
@lucidrains
Copy link
Owner

thanks for doing this! it demonstrates that reversibility does work :)

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 15, 2021

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

@afiaka87
Copy link
Contributor Author

thanks for doing this! it demonstrates that reversibility does work :)

There should be system usage info in the graphs on wandb.ai, but yeah it does what it says on the label, lol. You definitely trade time for space. But, that whole training session never went above 16 GiB of VRAM. So at least people can use colab!

@lucidrains
Copy link
Owner

@afiaka87 great to know! also, do log the issue with the VQ-GAN VAE and i'll be sure to fix it this week. It seems to be working on my end, but I haven't tried testing it from a fresh install

@afiaka87
Copy link
Contributor Author

@lucidrains one last thing, but the "image masking" feature is used pretty thoroughly in this dataset and they even have the image used for the mask and everything. Let me know as soon as that feature is implemented as I would love to use those as a baseline for that.

@lucidrains
Copy link
Owner

@afiaka87 is that the feature where they have half an image and have it complete the other half?

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 15, 2021

@lucidrains Yes. "The exact same cat on the top {insert style qualifer here} on the bottom." style ones. They're passing the top half in, as well as a prompt that acknowledges both pictures and presumably forcing the top half to stay the same while it trains.

@lucidrains
Copy link
Owner

@afiaka87 yup, i can build that :)

@afiaka87
Copy link
Contributor Author

Great let me know asap. The zero-shot style transfer stuff is so cool to me.

@robvanvolt
Copy link
Contributor

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

Agree! Really awesome work of lucidrains for trying to replicate such an awesome tool like DALL-E! If we only could collaborate in a more efficient way - somehow like in the blockchain, where a few people improve the DALL-E and the best get chosen after 2 days, gets distributed again and a new search for better optimization begins... I think your hyperparameter session is a great step forward @afiaka87 ! I will have my big system running in a week, so i hope to contribute then in a more significant way!

By the way, the open images V6 dataset (https://storage.googleapis.com/openimages/web/download.html) has "localized" narratives, which might fit perfectly for the Dall-E for training! Maybe I will generate a downsampled version (256x256px) with captions like in the DALL-E format required, that would speed up search for training dataset and could improve collaborations.

@afiaka87
Copy link
Contributor Author

@robvanvolt Yep that's a perfect dataset. I found this dataloader here.

https://github.com/google/localized-narratives/blob/master/localized_narratives.py

And the downloader for the images:
https://raw.githubusercontent.com/openimages/dataset/master/downloader.py

You should be able to modify the DataLoader to load the correct image for the given localized narrative somewhat easily. This would also lend itself well to Weights and Biases artifacts (you just map urls to things and it downloads and caches them for you, pinning versions if things change).

Let me know if you get started on this and need any help. I think this would produce a great result!

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 17, 2021

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

Agree! Really awesome work of lucidrains for trying to replicate such an awesome tool like DALL-E! If we only could collaborate in a more efficient way - somehow like in the blockchain, where a few people improve the DALL-E and the best get chosen after 2 days, gets distributed again and a new search for better optimization begins... I think your hyperparameter session is a great step forward @afiaka87 ! I will have my big system running in a week, so i hope to contribute then in a more significant way!

By the way, the open images V6 dataset (https://storage.googleapis.com/openimages/web/download.html) has "localized" narratives, which might fit perfectly for the Dall-E for training! Maybe I will generate a downsampled version (256x256px) with captions like in the DALL-E format required, that would speed up search for training dataset and could improve collaborations.

I went ahead and downloaded all 500,000 of all of their images with "localized annotations". I'm training currently! The download is not for the faint of heart though. Winds up being 169 GiB of data. Anyway, I can at least share the proper structure for the "*.txt" files as well as the "file_ids.txt" list of of image ids to download.

wget https://www.dropbox.com/s/3s0saz480hlg651/ids_to_download.txt
wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz # contains structure for localized annotations. Plop this folder next to the folder you put your images in.
tar -cf captions.tar.gz --directory=~/project_name/captions .

@Jinglei5
Copy link

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

Agree! Really awesome work of lucidrains for trying to replicate such an awesome tool like DALL-E! If we only could collaborate in a more efficient way - somehow like in the blockchain, where a few people improve the DALL-E and the best get chosen after 2 days, gets distributed again and a new search for better optimization begins... I think your hyperparameter session is a great step forward @afiaka87 ! I will have my big system running in a week, so i hope to contribute then in a more significant way!
By the way, the open images V6 dataset (https://storage.googleapis.com/openimages/web/download.html) has "localized" narratives, which might fit perfectly for the Dall-E for training! Maybe I will generate a downsampled version (256x256px) with captions like in the DALL-E format required, that would speed up search for training dataset and could improve collaborations.

I went ahead and downloaded all 500,000 of all of their images with "localized annotations". I'm training currently! The download is not for the faint of heart though. Winds up being 169 GiB of data. Anyway, I can at least share the proper structure for the "*.txt" files as well as the "file_ids.txt" list of of image ids to download.

wget https://www.dropbox.com/s/3s0saz480hlg651/ids_to_download.txt
wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz # contains structure for localized annotations. Plop this folder next to the folder you put your images in.
tar -cf captions.tar.gz --directory=~/project_name/captions .

Thanks a lot! However, I could not download the captions.tar.gz from your dropbox (maybe the link is broken, since ids_to_download.txt is fine to download). I wonder how you reorganized the captions from the annotations. Did you use classname as the caption of the image? Thanks again!

@afiaka87
Copy link
Contributor Author

@Jinglei5 Hm, I'll see if I can fix that. Unfortunately my internet has just gone out halfway through training 🙄. On my phone till it's back up so it may be a bit.

@afiaka87
Copy link
Contributor Author

Hm, dropbox is insisting that I've set that file to be publically shared. Wouldy ou mind trying again with this?

wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz?dl=0

You'll have to rename the file as it will include the ?dl=0 bit, but that's the only thing I can think of. If that still doesnt work, i'll host it elsewhere.

@Jinglei5 as for how i reorganized the captions, the current DataLoader literally just expects every single unique png in your folder to have a respectively named txt that contains its text descriptions. If you go to the "localized annotations" page, you'll find a .jsonl file containing a mapping of each text phrase to image ids. The rest is just some python scripting to create a bunch of files with the same names as your images and fill them with the correct text descriptions.

Here's my copy of the .jsonl file https://www.dropbox.com/s/9g6hbnyc1pek462/open_images_train_v6_captions.tar.xz?dl=0

Probably best to find the original again though. I'll be back with an edit.

@afiaka87
Copy link
Contributor Author

Just a general heads up though - these captions aren't great. Due to the ability to use a mouse to "tell the dataset" where in the image they were referring to, they often leave out explicit directions knowing that the information will be in there.

For instance:

"in this image there is a depiction in the bottom of this image and there are two persons standing on the right side to this , and there are some shelters in the background , and there are some trees as we can see in the bottom of this image , and there is a sky on the top of this image ."

or

"in the down side it is water . in the long back side there are trees and big buildings"

The captions not only contain pretty glaring grammar mistakes but the information about the location is also missing from these prompts because the annotater (labeler? what do we call that?) knows that the computer is getting that info from their mouse.

@Jinglei5
Copy link

Hm, dropbox is insisting that I've set that file to be publically shared. Wouldy ou mind trying again with this?

wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz?dl=0

You'll have to rename the file as it will include the ?dl=0 bit, but that's the only thing I can think of. If that still doesnt work, i'll host it elsewhere.

@Jinglei5 as for how i reorganized the captions, the current DataLoader literally just expects every single unique png in your folder to have a respectively named txt that contains its text descriptions. If you go to the "localized annotations" page, you'll find a .jsonl file containing a mapping of each text phrase to image ids. The rest is just some python scripting to create a bunch of files with the same names as your images and fill them with the correct text descriptions.

Here's my copy of the .jsonl file https://www.dropbox.com/s/9g6hbnyc1pek462/open_images_train_v6_captions.tar.xz?dl=0

Probably best to find the original again though. I'll be back with an edit.

It works this time! Thanks!
True, in captions, there are phrases like 'In front of the picture' and 'we see'. Not sure whether they are useful or having side-effect for the model.

@afiaka87
Copy link
Contributor Author

@Jinglei5 I'm gonna try mixing it with COCO2018 to see if it can at least get an idea of what a regular prompt might look like.

@afiaka87
Copy link
Contributor Author

@Jinglei5 also currently in the (very lengthy) process of converting all of these to 256px jpegs so I can actually move them around a bit. Do you have an existing workflow for that? Right now I'm just using imagemagick convert in a for loop.

@robvanvolt
Copy link
Contributor

robvanvolt commented Mar 17, 2021

Hm the annotations looked pretty solid in the first place, but we will see how the grammar mistakes and the bad orientation get handled...

A few other interesting points:

Dall-E was trained with redundancy, e.g.

a neon sign that reads“backprop”. a neon sign thatreads “backprop”. backpropneon sign

So this shouldn't be a problem, as i previously thought.

64 16 GB NVIDIA V100 GPUs, with a per-GPU batch sizeof8, resulting in a total batch size of 512

Incredible computation power was used by Open-AI - this will be tough to optimize to get near the results of Open-AI...

we created a dataset of a similar scale to JFT-300M by collecting 250 million text-image pairs from the internet.

Also, the dataset collected is insane - 250 million!

his dataset incorporates Conceptual Captions,the text-image pairs from Wikipedia, and a filtered subset of YFCC100M.

Wikipedia might be another solid source for text-image pairs.

Also, we might need to establish a better filter that we all use for training:

These filters include discarding instances whose captions are too short, are classified as non-English by the Python package cld3, or that consist primarily of boilerplate phrases such as “photographed on ”, where matches various formats for dates that we found in the data.

And finally:

We also discard instances whose images have aspect ratios not in[1/2,2]. If we were to use to very tall or wide images, thenthe square crops used during training would likely exclude objects mentioned in the caption.

This might also be important, as i've seen a lot of images in different aspect ratios.

On the other hand, we might have a better / faster transformer with 1024 VQGAN, which might speed up things a little bit.

@Jinglei5
Copy link

@Jinglei5 also currently in the (very lengthy) process of converting all of these to 256px jpegs so I can actually move them around a bit. Do you have an existing workflow for that? Right now I'm just using imagemagick convert in a for loop.

Sorry, I don't have the workflow. I just sampled 10,000 of them to feed the model directly for a trial right now. ><

@afiaka87
Copy link
Contributor Author

afiaka87 commented Mar 17, 2021

On the other hand, we might have a better / faster transformer with 1024 VQGAN, which might speed up things a little bit.

@robvanvolt Here's some early results from training on that dataset by the way. I think we should definitely clean it up with the info from OpenAI.
https://wandb.ai/afiaka87/OpenImagesV6/reports/dalle-pytorch-OpenImagesV6-With-Localized-Annotations---Vmlldzo1MzgyMTU

After about ~15k iters, I stopped training, added the COCO2018 dataset and resumed from there for another ~6K steps.
https://wandb.ai/afiaka87/OpenImagesV6/reports/OpenImagesV6-COCO--Vmlldzo1MzgyNTI

@lucidrains @Jinglei5

@afiaka87
Copy link
Contributor Author

I'll probably make another post once I'm finished training. I think i'm ultimately gonna go with a combination of all three datasets I've accrued so far: COCO2018, OpenImagesV6 and the ~1 million images from the OpenAI blog post. The size of openai's dataset is definitely discouraging though.

@robvanvolt I'm assuming there's a relatively easy way to get captioned images from wikipedia, no? That's what I'm after next.

@afiaka87
Copy link
Contributor Author

@Jinglei5 also currently in the (very lengthy) process of converting all of these to 256px jpegs so I can actually move them around a bit. Do you have an existing workflow for that? Right now I'm just using imagemagick convert in a for loop.

Sorry, I don't have the workflow. I just sampled 10,000 of them to feed the model directly for a trial right now. ><

Ha I do that as well. It is insane to me the number of things that just straight up break when you're dealing with lots of files.

It's all good though, I managed to figure it out:

find . -type f -name "*.jpg" | parallel mogrify -resize 256x {}

@robvanvolt
Copy link
Contributor

I'll probably make another post once I'm finished training. I think i'm ultimately gonna go with a combination of all three datasets I've accrued so far: COCO2018, OpenImagesV6 and the ~1 million images from the OpenAI blog post. The size of openai's dataset is definitely discouraging though.

@robvanvolt I'm assuming there's a relatively easy way to get captioned images from wikipedia, no? That's what I'm after next.

Yes, it seems that on the 20th of March, 2021, there might be a solution which fits exactly our needs:

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. [...] We are hoping to make the WIT dataset available for download by March 20th, 2021. (tentatively).

https://github.com/google-research-datasets/wit

@afiaka87
Copy link
Contributor Author

Moving these to discussions.

@afiaka87
Copy link
Contributor Author

I'll probably make another post once I'm finished training. I think i'm ultimately gonna go with a combination of all three datasets I've accrued so far: COCO2018, OpenImagesV6 and the ~1 million images from the OpenAI blog post. The size of openai's dataset is definitely discouraging though.
@robvanvolt I'm assuming there's a relatively easy way to get captioned images from wikipedia, no? That's what I'm after next.

Yes, it seems that on the 20th of March, 2021, there might be a solution which fits exactly our needs:

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. [...] We are hoping to make the WIT dataset available for download by March 20th, 2021. (tentatively).

https://github.com/google-research-datasets/wit

funny how fast things change, eh?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants