by Bram Verhoef – Algorithm Architect at AXELERA AI
Summary – Convolutional neural networks (CNNs) still dominate today’s computer vision. Recently, however, networks based on transformer blocks have also been applied to typical computer vision tasks such as object classification, detection, and segmentation, attaining state-of-the-art results on standard benchmark datasets.
However, these vision-transformers (ViTs) are usually pre-trained on extremely large datasets and may consist of billions of parameters, requiring teraflops of computing power. Furthermore, the self-attention mechanism inherent to classical transformers builds on quadratically complex computations.
To mitigate some of the problems posed by ViTs, a new type of network based solely on multilayer perceptrons (MLPs), has recently been proposed. These vision-MLPs (V-MLP) shrug off classical self-attention but still achieve global processing through their fully connected layers.
In this blog post, we review the V-MLP literature, compare V-MLPs to CNNs and ViTs, and attempt to extract the ingredients that really matter for efficient and accurate deep learning-based computer vision.
In computer vision, CNNs have been the de facto standard networks for years. Early CNNs, like AlexNet  and VGGNet , consisted of a stack of convolutional layers, ultimately terminating in several large fully connected layers used for classification. Later, networks were made progressively more efficient by reducing the size of the classifying fully connected layers using global average pooling . Furthermore these more efficient networks, among other adjustments, reduce the spatial size of convolutional kernels [4, 5], employ bottleneck layers and depthwise convolutions [5, 6], and use compound scaling of the depth, width and resolution of the network . These architectural improvements, together with several improved training methods  and larger datasets have led to highly efficient and accurate CNNs for computer vision.
Despite their tremendous success, CNNs have their limitations. For example, their small kernels (e.g., 3×3) give rise to small receptive fields in the early layers of the network. This means that information processing in early convolutional layers is local and often insufficient to capture an object’s shape for classification, detection, segmentation, etc. This problem can be mitigated using deeper networks, increased strides, pooling layers, dilated convolutions, skip connections, etc., but these solutions either lose information or increase the computational cost. Another limitation of CNNs stems from the inductive bias induced by the weight sharing across the spatial dimensions of the input. Such weight sharing is modeled after early sensory cortices in the brain and (hence) is well adapted to efficiently capture natural image statistics. However, it also limits the model’s capacity and restricts the tasks to which CNNs can be applied.
Recently, there has been much research to solve the problems posed by CNNs by employing transformer blocks to encode and decode visual information. These so-called Vision Transformers (ViTs) are inspired by the success of transformer networks in Natural Language Processing (NLP)  and rely on global self-attention to encode global visual information in the early layers of the network. The original ViT was isotropic (it maintains an equal-resolution-and-size representation across layers), permutation invariant, based entirely on fully connected layers and relying on global self attention . As such, the ViT solved the above-mentioned problems related to CNNs by providing larger (dynamic) receptive fields in a network with less inductive bias.
This is exciting research but it soon became clear that the ViT was hard to train, not competitive with CNNs when trained on relatively small datasets (e.g., IM-1K, ), and computationally complex as a result of the quadratic complexity of self-attention. Consequently, further studies sought to facilitate training. One approach was using network distillation . Another was to insert CNNs at the early stages of the network . Further attempts to improve ViTs re-introduced inductive biases found in CNNs (e.g., using local self attention  and hierarchical/pyramidal network structures ). There were also efforts to replace dot-product QKV-self-attention with alternatives [e.g. 16]. With these modifications now in place, vision transformers can compete with CNNs with respect to computational efficiency and accuracy, even when trained on relatively small datasets [see this blog post by Bert Moons for more discussion on ViTs].
Notwithstanding the success of recent vision transformers, several studies demonstrate that models building solely on multilayer perceptrons (MLPs) — so-called vision MLPs (V-MLPs) — can achieve surprisingly good results on typical computer vision tasks like object classification, detection and segmentation. These models aim for global spatial processing, but without the computationally complex self-attention. At the same time, these models are easy to scale (high model capacity) and seek to retain a model structure with low inductive bias, which makes them applicable to a wide range of tasks .
Like ViTs, the V-MLPs first decompose the images into non-overlapping patches, called tokens, which form the input into a V-MLP block. A typical V-MLP block consists of a spatial MLP (token mixer) and a channel MLP (channel mixer), interleaved by (layer) normalization and complemented with residual connections. This is illustrated in Figure 1.
Figure 1. Typical V-MLP structure. Adapted from .
Here the spatial MLP captures the global correlations between tokens, while the channel MLP combines information across features. This can be formulated as follows:
Here X is a matrix containing the tokens, Y consists of intermediate features, LN denotes layer normalization, and Z is the output feature of the block. In these equations, spatialMLP and channelMLP can be any nonlinear function represented by some type of MLP with activation function (e.g. GeLU).
In practice, the channelMLP is often implemented by one or more 1×1 convolutions, and most of the innovation found in different studies lies in the structure of the spatialMLP submodule. And, here’s where history repeats itself. Where ViTs started as isotropic models with global spatial processing (e.g., ViT  or DeiT ), V-MLPs did so too (e.g., MLP-Mixer  or ResMLP ). Where recent ViTs improved their accuracy and performance on visual tasks by adhering to a hierarchical structure with local spatial processing (e.g., Swin-transformer  or NesT ), recent V-MLPs do so too (e.g., Hire-MLP  or S^2-MLPv2 ). These modifications made the models more computationally efficient (fewer parameters and FLOPs), easier to train and more accurate, especially when trained on relatively small datasets. Hence, over time both ViTs and V-MLPs re-introduced the inductive biases well known from CNNs.
Due to their fully connected nature, V-MLPs are not permutation invariant and thus do not necessitate the type of positional encoding frequently used in ViTs. However, one important drawback of pure V-MLPs is the fixed input resolution required for the spatialMLP submodule. This makes transfer to downstream tasks, such as object detection and segmentation, difficult. To mitigate this problem, some researchers have inserted convolutional layers or, similarly, bicubic interpolation layers, into the V-MLP (e.g., ConvMLP  or RaftMLP ). Of course, to some degree, this defies the purpose of V-MLPs. Other studies have attempted to solve this problem using MLPs only (e.g., [20, 21, 30]), but the data-shuffling needed to formulate the problem as an MLP results in an operation that is very similar or even equivalent to some form of (grouped) convolution.
See Table 1 for an overview of different V-MLPs. Note how some of the V-MLP models are very competitive with (or better than) state-of-the-art CNNs, e.g. ConvNeXt-B with 89M parameters, 45G FLOPs and 83.5% accuracy .
Table 1. Overview of some V-MLPs. For each V-MLP, we present the accuracy of the largest reported model that is trained on IM-1K only.
It is important to note that the high-level structure of V-MLPs is not new. Depthwise-separable convolutions for example, as used in MobileNets , consist of a depthwise convolution (spatial mixer) and a pointwise 1×1 convolution (channel mixer). Furthermore, the standard transformer block comprises a self-attention layer (spatial mixer) and a pointwise MLP (channel mixer). This suggests that the good performance and accuracy obtained with these models results at least partly from the high-level structure of layers used inside V-MLPs and related models. Specifically, (1) the use of non-overlapping spatial patch embeddings as inputs, (2) some combination of independent spatial (with large enough spatial kernels) and channel processing, (3) some interleaved normalization, and (4) residual connections. Recently, such a block structure has been dubbed “Metaformer” (, Figure 2), referring to the high-level structure of the block, rather than the particular implementation of its subcomponents. Some evidence for this hypothesis comes from , who used a simple isotropic purely convolutional model, called “ConvMixer,” that takes non-overlapping patch embeddings as inputs. Given an equal parameter budget, their model shows improved accuracy compared to standard ResNets and DeiT. A more thorough analysis of this hypothesis was performed by “A ConvNet for the 2020s,” , which systematically examined the impact of block elements 1-4, finding a purely convolutional model reaching SOTA performance on ImageNet, even when trained on IN-1K alone.
Figure 2. a. V-MLP, b. Transformer and c. MetaFormer. Adapted from .
Taken together, these studies suggest that what matters for efficient and accurate vision models are the particular layer ingredients found in the Metaformer block (tokenization, independent spatial and channel processing, normalization and residual blocks) and the inductive biases typically found in CNNs (local processing with weight sharing and a hierarchical network structure). Clearly, this conclusion does not imply a special role for MLPs, as the Metaformer structure building on purely convolutional layers works (almost) just as well.
So are there other reasons for the recent focus on V-MLPs? The above-mentioned convolutional Metaformers were all tested on vision tasks and it is well known that the convolutional structure matches well with natural image statistics. Indeed, as mentioned above the best performing V-MLPs and ViTs (re-)introduce the inductive biases, such as local hierarchical processing, typically found in CNNs. However, if one is interested in a generic model that performs well in multimodal tasks and has lower computational complexity than standard transformers, an MLP-based network can be a good choice. For example, some initial results show that MLP-based Metaformers also perform well on NLP tasks [18, 29].
An additional benefit of isotropic MLPs is that they scale more easily. This scalability can make it easier to implement them on compute infrastructure that relies on regular compute patterns. Furthermore, it facilitates capturing the high information content of large (multimodal) datasets.
So based on current findings we can formulate the following practical guidelines: for settings that are significanlty resource- and data-constrained, such as edge computing, there is currently little evidence that V-MLPs, like ViTs, are a superior alternative to CNNs. However, when datasets are large and/or multimodal, and compute is more abundant, pure MLP-based models may be a more efficient and generic choice compared to CNNs and transformer-based models that rely on self-attention.
We are still in the early days of examining the possibilities of MLP-based models. In just 9 months the accuracy of V-MLPs on ImageNet classification increased by a stunning ~8%. It is expected that these models will improve further and that hybrid networks, which properly combine MLPs, CNNs and attention mechanisms, have the potential to significantly outperform existing models (e.g. ). We are excited to be part of this future.
 Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012): 1097-1105.
 Karen Simonyan, Andrew Zisserman: “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014; [http://arxiv.org/abs/1409.1556 arXiv:1409.1556].
 Min Lin, Qiang Chen, Shuicheng Yan: “Network In Network,” 2013; [http://arxiv.org/abs/1312.4400 arXiv:1312.4400].
 Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer: “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” 2016; [http://arxiv.org/abs/1602.07360 arXiv:1602.07360].
 He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam: “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017; [http://arxiv.org/abs/1704.04861 arXiv:1704.04861].
 Mingxing Tan, Quoc V. Le: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” 2019, International Conference on Machine Learning, 2019; [http://arxiv.org/abs/1905.11946 arXiv:1905.11946].
 Ross Wightman, Hugo Touvron, Hervé Jégou: “ResNet strikes back: An improved training procedure in timm,” 2021; [http://arxiv.org/abs/2110.00476 arXiv:2110.00476].
 Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018; [http://arxiv.org/abs/1810.04805 arXiv:1810.04805].
 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” 2020; [http://arxiv.org/abs/2010.11929 arXiv:2010.11929].
 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).
 Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou: “Training data-efficient image transformers & distillation through attention,” 2020; [http://arxiv.org/abs/2012.12877 arXiv:2012.12877].
 Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick: “Early Convolutions Help Transformers See Better,” 2021; [http://arxiv.org/abs/2106.14881 arXiv:2106.14881].
 Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo: “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” 2021; [http://arxiv.org/abs/2103.14030 arXiv:2103.14030].
 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao: “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,” 2021; [http://arxiv.org/abs/2102.12122 arXiv:2102.12122].
 Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira: “Perceiver: General Perception with Iterative Attention,” 2021; [http://arxiv.org/abs/2103.03206 arXiv:2103.03206].
 Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy: “MLP-Mixer: An all-MLP Architecture for Vision,” 2021; [http://arxiv.org/abs/2105.01601 arXiv:2105.01601].
 Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou: “ResMLP: Feedforward networks for image classification with data-efficient training,” 2021; [http://arxiv.org/abs/2105.03404 arXiv:2105.03404].
 Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan O. Arik, Tomas Pfister: “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding,” 2021; [http://arxiv.org/abs/2105.12723 arXiv:2105.12723].
 Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, Yunhe Wang: “Hire-MLP: Vision MLP via Hierarchical Rearrangement,” 2021; [http://arxiv.org/abs/2108.13341 arXiv:2108.13341].
 Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li: “S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision,” 2021; [http://arxiv.org/abs/2108.01072 arXiv:2108.01072].
 Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi: “ConvMLP: Hierarchical Convolutional MLPs for Vision,” 2021; [http://arxiv.org/abs/2109.04454 arXiv:2109.04454].
 Yuki Tatsunami, Masato Taki: “RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?” 2021; [http://arxiv.org/abs/2108.04384 arXiv:2108.04384].
 Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan: “MetaFormer is Actually What You Need for Vision,” 2021; [http://arxiv.org/abs/2111.11418 arXiv:2111.11418].
 Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, Yunhe Wang: “An Image Patch is a Wave: Quantum Inspired Vision MLP,” 2021; [http://arxiv.org/abs/2111.12294 arXiv:2111.12294].
 Ziyu Wang, Wenhao Jiang, Yiming Zhu, Li Yuan, Yibing Song, Wei Liu: “DynaMixer: A Vision MLP Architecture with Dynamic Mixing,” 2022; [http://arxiv.org/abs/2201.12083 arXiv:2201.12083].
 Asher Trockman, J. Zico Kolter: “Patches Are All You Need?” 2022; [http://arxiv.org/abs/2201.09792 arXiv:2201.09792].
 Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie: “A ConvNet for the 2020s,” 2022; [http://arxiv.org/abs/2201.03545 arXiv:2201.03545].
 Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le: “Pay Attention to MLPs,” 2021; [http://arxiv.org/abs/2105.08050 arXiv:2105.08050].
 Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou: “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs,” 2022; [http://arxiv.org/abs/2202.06510 arXiv:2202.06510].