LayoutDETR

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng
Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, Ran Xu

Salesforce Research

arXiv 2023

paper | project page

Abstract

Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs is skill-demanding, time-consuming, and non-scalable to batch production. Generative models emerge to make design automation scalable but it remains non-trivial to produce designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground content. We propose LayoutDETR that inherits the high quality and realism from generative modeling, while reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal foreground elements in a layout. Our solution sets a new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ad banner dataset. We integrate our solution into a graphical system that facilitates user studies, and show that users prefer our designs over baselines by significant margins.

Prerequisites

Linux
NVIDIA GPU + CUDA 11.3

To install conda virtual environment, run

 conda env create -f environment.yaml
 conda activate layoutdetr

For training, download Up-DETR pretrained weights to pretrained/.

For inference and layout generation in the wild, build Chrome-based text rendering environment by running

 apt-get update
 cp chromedriver /usr/bin/chromedriver
 ln -fs /usr/share/zoneinfo/America/Los_Angelos /etc/localtime
 DEBIAN_FRONTEND=noninteractive apt --assume-yes install ./google-chrome-stable_current_amd64.deb

Data preprocessing

Our ad banner dataset (14.7GB, 7,672 samples). Part of the source images are filtered from Pitt Image Ads Dataset and the others are crawled from Google image search engine with a variety of retailer brands as keywords. Download our dataset and unzip to data/ which contains three subdirectories:

png_json_gt/ subdirectory contains:
- *.png files representing well-designed images with foreground elements superimposed on the background.
- Corresponding *.json files with the same file names as of *.png, representing the layout ground truth of foreground elements of each well-designed image. Each *.json file contains:
  - xyxy_word_fit key: A set of bounding box annotations in the form of [cy, cx, height, width], detected by our Salesforce Einstein OCR.
  - str key: Their text contents if any, also recognized by our Salesforce Einstein OCR.
  - label key: Their element categories annotated manually through Amazon Mechanical Turk. The interesting categories include {header, pre-header, post-header, body text, disclaimer / footnote, button, callout, logo}.
1x_inpainted_background_png/ subdirectory correspondingly contains a set of *_inpainted.png files representing the background-only images of the well-designed images. The subregions that were superimposed by foreground elements have been inpainted by the LaMa technique. These background images should be used for inference or evaluation only, not for training.
3x_inpainted_background_png/ subdirectory also correspondingly contains a set of *_inpainted.png files representing the background-only images of the well-designed images. There are 2x extra random subregions also inpainted, which aim at avoiding generator being overfitted to inpainted subregions if we inpaint only ground truth layouts. The augmented inpainting subregions serve as false postive which are inpainted but are not ground truth layouts. We use these background images for training.

To preprocess the dataset that are efficient for training, run

python dataset_tool.py \
--source=data/ads_banner_dataset/png_json_gt \
--dest=data/ads_banner_dataset/zip_3x_inpainted \
--inpaint-aug

where

--source indicates the source data direcotry path where you downloaded the raw dataset.
--dest indicates the preprocessed data direcotry path containing two files: train.zip and val.zip which are 9:1 splitted from the source data.
inpaint-aug indicates using 3x_inpainted_background_png/ with extra inpainting on background instead of using 1x_inpainted_background_png/. Use this argument when preprocessing training data.

Training

python train.py --gpus=8 --batch=16 \
--data=data/ads_banner_dataset/zip_3x_inpainted/train.zip \
--outdir=training-runs \
--metrics=layout_fid50k_train,layout_fid50k_val,fid50k_train,fid50k_val,overlap50k_alignment50k_layoutwise_iou50k_layoutwise_docsim50k_train,overlap50k_alignment50k_layoutwise_iou50k_layoutwise_docsim50k_val

where

--batch indicates the total batch size on all the GPUs.
--data indicates the preprocessed training data .zip file path.
--outdir indicates the output direcotry path of model checkpoints, result snapshots, config record file, log file, etc.
--metrics indicates the evaluation metrics measured for each model checkpoint during training, which can include layout FID, image FID, overlap penalty, misalignment penalty, layout-wise IoU, and layout-wise DocSim, etc. See more metric options in metrics/metric_main.py.
See the definitions and default settings of the other arguments in train.py.

Evaluation

Download the well-trained LayoutDETR model on our ad banner dataset from here (2.7GB) to checkpoints/.

python evaluate.py --gpus=8 --batch=16 \
--data=data/ads_banner_dataset/zip_1x_inpainted/val.zip \
--outdir=evaluation \
--ckpt=checkpoints/layoutdetr_ad_banner.pkl \
--metrics=layout_fid50k_val,fid50k_val,overlap50k_alignment50k_layoutwise_iou50k_layoutwise_docsim50k_val,rendering_val

where

--ckpt indicates the path of the well-trained generator .pkl file.
--metrics=rendering_val indicates to render texts on background images given generated layouts.

Layout generation in the wild

python generate.py \
--ckpt=checkpoints/layoutdetr_ad_banner.pkl \
--bg='examples/Lumber 2 [header]EVERYTHING 10% OFF[body text]Friends & Family Savings Event[button]SHOP NOW[disclaimer]CODE FRIEND10.jpg' \
--bg-preprocessing=256 \
--strings='EVERYTHING 10% OFF|Friends & Family Savings Event|SHOP NOW|CODE FRIEND10' \
--string-labels='header|body text|button|disclaimer / footnote' \
--outfile='examples/output/Lumber 2' \
--out-postprocessing=horizontal_center_aligned

where

--ckpt indicates the path of the well-trained generator .pkl file.
--bg indicates the provided background image file path.
--bg-preprocessing indicates the preprocessing operation to the background image. The default is none, meaning no preprocessing.
--strings indicates the ads text strings, the bboxes of which will be generated on top of the background image. Multiple (<10) strings are separated by |.
--string-labels indicates the ads text string labels, selected from {header, pre-header, post-header, body text, disclaimer / footnote, button, callout, logo}. Multiple (<10) strings are separated by |.
--outfile indicates the output file path and name (without extension).
--out-postprocessing indicates the postprocessing operation to the output bbox parameters so as to guarantee alignment and remove overlapping. The operation can be selected from {none, horizontal_center_aligned, horizontal_left_aligned}. The default is none, meaning no postprocessing.
The values of generated bbox parameters [cy, cx, h, w] can be read from the variable bbox_fake (in the shape of BxNx4, where B=1 and N=#strings in one ads) in generate.py.

Citation

@article{yu2023layoutdetr,
	title={LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer},
	author={Yu, Ning and Chen, Chia-Chih and Chen, Zeyuan and Meng, Rui and Wu, Gang and Josel, Paul and Niebles, Juan Carlos and Xiong, Caiming and Xu, Ran},
	journal={arXiv preprint arXiv:2212.09877},
	year={2023}
  }

Acknowledgement

We thank Abigail Kutruff, Brian Brechbuhl, Elham Etemad, and Amrutha Krishnan from Salesforce for constructive advice.
We express gratitudes to the StyleGAN3, LayoutGAN++, DETR, Up-DETR, and BLIP, as our code was modified from their repositories.
We also acknowledge the data contribution of Pitt Image Ads Dataset and technical contribution of LaMa.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
assets		assets
configs		configs
detr_util		detr_util
dnnlib		dnnlib
e2e_pipeline		e2e_pipeline
examples		examples
gui_utils		gui_utils
metrics		metrics
pretrained		pretrained
torch_utils		torch_utils
training		training
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
chromedriver		chromedriver
dataset_tool.py		dataset_tool.py
environment.yaml		environment.yaml
evaluate.py		evaluate.py
generate.py		generate.py
generate_util.py		generate_util.py
google-chrome-stable_current_amd64.deb		google-chrome-stable_current_amd64.deb
legacy.py		legacy.py
train.py		train.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LayoutDETR

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

paper | project page

Abstract

Prerequisites

Data preprocessing

Training

Evaluation

Layout generation in the wild

Citation

Acknowledgement

About

Releases

Packages

Languages

License

salesforce/LayoutDETR

Folders and files

Latest commit

History

Repository files navigation

LayoutDETR

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

paper | project page

Abstract

Prerequisites

Data preprocessing

Training

Evaluation

Layout generation in the wild

Citation

Acknowledgement

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages