Generating images from object detection layouts via VQ-VAE and a Transformer

Released in: Modeling Image Composition for Complex Scene Generation



The authors present a method that achieves state-of-the-art results on challenging (few-shot) layout-to-image generation tasks by accurately modeling textures, structures and relationships contained in a complex scene. After compressing RGB images into patch tokens, they propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch. Compared to existing CNN-based and Transformer-based generation models that entangled modeling on pixel-level&patch-level and object-level&patch-level respectively, the proposed focal attention predicts the current patch token by only focusing on its highly-related tokens that specified by the spatial layout, thereby achieving disambiguation during training. Furthermore, the proposed TwFA largely increases the data efficiency during training, therefore the authors propose the first few-shot complex scene generation strategy based on the well-trained TwFA. Comprehensive experiments show the superiority of the proposed method, which significantly increases both quantitative metrics and qualitative visual realism with respect to state-of-the-art CNN-based and transformer-based methods.


Year Released

Key Links & Stats


Modeling Image Composition for Complex Scene Generation

@InProceedings{Yang_2022_CVPR, author = {Yang, Zuopeng and Liu, Daqing and Wang, Chaoyue and Yang, Jie and Tao, Dacheng}, title = {Modeling Image Composition for Complex Scene Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {7764-7773} }

ML Tasks

  1. Image Generation

ML Platform

  1. Not Applicable


  1. Still Image


  1. General

CG Platform

  1. Not Applicable

Related organizations

Shanghai JiaoTong University

JD Explore Academy

The University of Sydney