Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Abstract

Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

Keywords

Computer scienceSegmentationEncoderTransformerArtificial intelligencePascal (unit)Image segmentationComputer visionPattern recognition (psychology)Engineering

Affiliated Institutions

Related Publications

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu , Yutong Lin , Yue Cao +5 more

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer ...

2021 2021 IEEE/CVF International Conferenc... 25813 citations

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Liang-Chieh Chen , Yukun Zhu , George Papandreou +2 more

2018 Lecture notes in computer science 13300 citations

UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation

Zongwei Zhou , Md Mahfuzur Rahman Siddiquee , Nima Tajbakhsh +1 more

The state-of-the-art models for medical image segmentation are variants of U-Net and fully convolutional networks (FCN). Despite their success, these models have two limitations...

2019 IEEE Transactions on Medical Imaging 3567 citations

Dual Attention Network for Scene Segmentation

Jun Fu , Jing Liu , Haijie Tian +4 more

In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the self-attention mechanism. Unlike previous works that capture context...

2019 6497 citations

UNet++: A Nested U-Net Architecture for Medical Image Segmentation

Zongwei Zhou , Md Mahfuzur Rahman Siddiquee , Nima Tajbakhsh +1 more

2018 Lecture notes in computer science 7871 citations

Publication Info

Year: 2021
Type: article
Citations: 3257
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

3257

OpenAlex

Cite This

APA Style

                            
                                    Sixiao Zheng, 
                                
                                    Jiachen Lu, 
                                
                                    Hengshuang Zhao
                                
                                et al.
                            
                            (2021). 
                            Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. 
                            
                            .
                            https://doi.org/10.1109/cvpr46437.2021.00681

Identifiers

DOI: 10.1109/cvpr46437.2021.00681