Following Model Cards for Model Reporting (Mitchell et al.) and Lessons from Archives (Jo & Gebru), we're providing some information about about the discrete VAE (dVAE) that was used to train DALL·E.
The dVAE was developed by researchers at OpenAI to reduce the memory footprint of the transformer trained on the text-to-image generation task. The details involved in training the dVAE are described in the paper. This model card describes the first version of the model, released in February 2021. The model consists of a convolutional encoder and decoder whose architectures are described here and here, respectively. For questions or comments about the models or the code release, please file a Github issue.
The model is intended for others to use for training their own generative models.
This model is inappropriate for high-fidelity image processing applications. We also do not recommend its use as a general-purpose image compressor.
The model was trained on publicly available text-image pairs collected from the internet. This data consists partly of Conceptual Captions and a filtered subset of YFCC100M. We used a subset of the filters described in Sharma et al. to construct this dataset; further details are described in our paper. We will not be releasing the dataset.
The heavy compression from the encoding process results in a noticeable loss of detail in the reconstructed images. This renders it inappropriate for applications that require fine-grained details of the image to be preserved.