Adding Conditional Control to Text-to-Image Diffusion Models

Abstract

We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions"(zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models. © 2023 IEEE.

Keywords

Computer scienceConvolution (computer science)DiffusionEncoding (memory)Artificial intelligenceImage (mathematics)Noise (video)SegmentationConvolutional neural networkSet (abstract data type)Pattern recognition (psychology)Zero (linguistics)Artificial neural networkImage segmentationAlgorithmComputer visionPhysics

Affiliated Institutions

Stanford University US

Related Publications

High-Resolution Image Synthesis with Latent Diffusion Models

Patrick Esser , Björn Ommer , Robin Rombach +2 more

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image da...

2022 2022 IEEE/CVF Conference on Computer ... 10716 citations

Context Encoding for Semantic Segmentation

Hang Zhang , Kristin Dana , Jianping Shi +4 more

Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous con...

2018 2018 IEEE/CVF Conference on Computer ... 1436 citations

FiLM: Visual Reasoning with a General Conditioning Layer

Ethan Perez , Florian Strub , Harm de Vries +2 more

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple...

2018 Proceedings of the AAAI Conference on... 1301 citations

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan , Andrew Zisserman

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evalu...

2014 arXiv (Cornell University) 75390 citations

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen , George Papandreou , Florian Schroff +1 more

In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep ...

2017 arXiv (Cornell University) 7401 citations

Publication Info

Year: 2023
Type: article
Pages: 3813-3824
Citations: 2649
Access: Closed

External Links

Download PDF (Free) View on DOI.org arXiv Semantic Scholar

Social Impact

Altmetric

Adding Conditional Control to Text-to-Image Diffusion Models

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2649

OpenAlex

881

Influential

2929

CrossRef

Cite This

APA Style

                            
                                    Lvmin Zhang, 
                                
                                    Anyi Rao, 
                                
                                    Maneesh Agrawala
                                
                            (2023). 
                            Adding Conditional Control to Text-to-Image Diffusion Models. 
                            2023 IEEE/CVF International Conference on Computer Vision (ICCV)
                            
                            , 3813-3824.
                            https://doi.org/10.1109/iccv51070.2023.00355

Identifiers

DOI: 10.1109/iccv51070.2023.00355
arXiv: 2302.05543

Data Quality

Data completeness: 88%