Leveraging GPT-2 (or GPT-anything) to generate better prompts for stable diffusion 2.
The way it works is the following:
- We sample a number of images from the COCO dataset and use them as our training data,
- Then we pass these images to a frozen version of MetaAI's Detectron 2 model, which would give us a json describing the items in the picture.
- Then we would this json to generate a simplified string describing this image.
- Pass this string to GPT-2 with a system prompt telling it to come up with an image-generation prompt and how the string was structured.
- Then the output of this step would be passed into a frozen Stable Diffusion 2 pipeline to generate the output image
- We used the SSIM loss to compare the input image and the output image to fine-tune the weights of our GPT-2 instance.
[Final Project for UCLA's COM SCI 263 - Natural Language Processing]