Transformers in Computer Vision

by Bert Moons –  System Architect at AXELERA AI

Summary: Convolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3],  RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.

A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:

  1. They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field,
  2. Training is stabilized by using batch-normalization and residual connections.
  3. Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x.
  4. Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last

Figure 1-1: Illustration of ResNet34 [3]

Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline[8].

Transformers in Computer Vision

A more radical evolution in Neural Networks for Computer Vision, is the move towards using Vision Transformers (ViT)[9] as a CNN-backbone replacement. Inspired by the astounding performance of Transformer models in Natural Language Processing (NLP)[10], research has moved towards applying the same principles in Computer Vision. Notable examples, among many others, are XCiT[11], PiT[12], DeiT[13] and SWIN-Transformers[14]. Here, analogously to NLP processing, images are essentially treated as sequences of image patches, by modeling feature maps as vectors of tokens, each token representing an embedding of a specific image patch.

Figure 1-2: Illustration of the original basic vision transformer (ViT), taken from [10]



Figure 1-3: Illustration of a self-attention module. K, Q and V are linear projections of the same input feature map. The attention map is a softmax function of the matrix product QKT. Image taken from source[15].

An illustration of a basic ViT is given in Figure 1-2.  The ViT is a sequence of stacked MLPs and self-attention layers, with or without residual connections . This ViT uses the multi-headed self-attention mechanism developed for NLP Transformer, see Figure 1-3. Such self-attention layer has two distinguishing features. It can (1) dynamically ‘guide’ its attention by dynamically reweighting the importance of specific features depending on the context and (2) has a full receptive field in case global self-attention is used. The latter is the case when self-attention is applied across all possible input tokens. Here all tokens, representing embeddings related to specific spatial image patches, are correlated with each other, giving a full perspective field. Global self-attention is typical in ViTs, but not a requirement. Self-attention can also be made local, by limiting the scope of the self-attention module to a smaller set of tokens, in turn reducing the operation’s receptive field at a particular stage.

This ViT architecture contrasts strongly with CNNs. In vanilla CNNs without attention mechanisms, (1) features are statically weighted using pretrained weights, rather than dynamically reweighted based on the context as in ViTs and and (2) receptive fields of individual network layers are typically local and limited by the convolutional kernel size.

Part of the success of CNNs, is their strong architectural inductive bias implied in the convolutional approach. Convolutions with shared weights explicitly encode how specific identical patterns are repeated in images. This inductive bias ensures easy training convergence on relatively small datasets, but also limits the modeling capacity of CNNs. Vision Transformers do not enforce such strict inductive biases. This makes them harder to train, but also increases their learning capacity, see Figure 1-5. To achieve good results using ViTs in Computer Vision, these networks are often trained using knowledge distillation with a large CNN-based teacher (as in DeiT[16] for example). This way, part of the inductive bias of CNNs can be more softly forced into the training process.

Initially, ViTs where directly inspired by NLP Transformers: massive models with a uniform topology and global self-attention, see Figure 1-4 (b). Recent ViTs have a macro-architecture that is closer to that of CNNs (Figure 1-4 (a)), using hierarchical pyramidal feature maps (as in PiT (Footnote 12); see Figure 1-4 (c)) and local self-attention (as in Swin-Transformers (Footnote 14). A high-level overview of this evolution is discussed in Table 1.)


Figure 1-4: comparing the dimension configurations of networks of (a) ResNet-50, a classical CNN with pyramidal feature maps, (b) an early ViT-S/16 [10] with a uniform macro-architecture and (c) a modern PiT-S [Footnote 12] with CNN-ified pyramidal feature maps. Figure taken from [Footnote 12].


Table 1: Comparing early ViTs, recent ViTs and modern CNNs


Comparing CNNs and ViTs for Edge Computing

Even though ViTs have shown State-of-the-Art (SotA) performance in many Computer Vision tasks, they do not necessarily outperform CNNs across the board. This is illustrated in Figure 1-5 and Figure 1-6. These figures compare the performance of ViTs and CNNs in terms of ImageNet validation accuracy versus model size and complexity, for various training regimes. It’s important to distinguish between these training regimes, as not all training methodologies are feasible for specific downstream tasks. First, for some applications there are only relatively small datasets available. In that case, CNNs typically perform better. Second, many ViTs rely on distillation approaches to achieve high performance. For that to work, they need a highly-accurate pretrained CNN as a teacher, which is not always available.

Figure 1-5 (a) illustrates how CNNs and ViTs compare in terms of model size versus accuracy if all types of training are allowed, including distillation approaches and using additional data (such as JFT-300[17]). Here ViTs perform on-par or better than large-scale CNNs, outperforming them in specific ranges. Notably, XCiT (Footnote 11) models perform particularly well in the +/- 3M-Parameters range. However, when neither distillation, nor training on extra data is allowed, the difference is less pronounced, see Figure 1-5 (b). In both Figures, EfficientNet-B0 and ResNet-50 are indicated as references for context.


Figure 1-5: Comparing CNNs to ViTs in terms of model size (# Params) and ImageNet Top-1 Validation accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without CNN-based knowledge distillation , but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates the lasting competivity of CNNs over ViTs, especially in the Edge domain for models with less than 25M parameters where performance is very similar between CNNs and ViTs. ResNet-50 and EfficientNet-B0 are given as reference points. Data is taken from this source[18] and the respective scientific papers.


Figure 1-6 illustrates the same in terms of accuracy versus model complexity for a more limited set of known networks. Figure 1-6(a) and (b) show CNNs are mostly dominant for lower accuracies and networks with lower complexity (<1B FLOPS) for all types of training.  This holds even for CNN-ified Vision-Transformers such as PiT (Footnote 12) which use a hierarchical architecture with pyramidal feature maps and for SWIN transformers which optimize complexity by using local self-attention.  Without extra data or distillation, CNNs typically outperform ViTs across the board, especially for networks with a lower complexity or for networks with accuracies lower than 80%. For example, at a similar complexity, both RegNets and EfficientNet-style networks significantly outperform XCiT ViTs, see Figure 1-6 (b).


Figure 1-6: Comparing SotA CNNs to ViTs in terms of computational cost (# FLOPS) and ImageNet Top-1 Validation Accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without extra data or knowledge distillation, but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates how CNNs are still dominant in the <80% accuracy regime. Even CNN-ified modern ViTs with hierarchical pyramidal models such as PiT [Footnote 12] do not outperform EfficientNet [Foonote 6] and RegNet [Footnote 4] style CNNs. In the 80%+ range, networks with local self-attention such as SWIN [Foonote 14] are on par or better than RegNets [Footnote 4]. Data is taken from Footnote 16  and the respective scientific papers.


Apart from the high-level differences in Table 1 and the performance differences in this section, there are some other key different requirements in bringing ViTs to edge devices. Compared to CNNs, ViT rely much more on 3 specific operations that must be properly accelerated on-chip. First, ViTs rely on accelerated softmax operators as part of self-attention, while CNNs only require softmax as the final layer in a classification network. On top of that, ViTs typically use smooth-nonlinear activation functions, while CNNs mostly rely on Rectified Linear Units (ReLU) which are much cheaper to execute and accelerate. Finally, ViTs typically require LayerNorm, a form of layer normalization with dynamic computation of mean and standard deviation to stabilize training. CNNs however, typically use batch-normalization, which must only be computed during training and can essentially be ignored in inference by folding the operation into neighbouring convolutional layers.



Vision Transformers are rapidly starting to dominate many applications in Computer Vision. Compared to CNNs, they achieve higher accuracies on large data sets due to their higher modeling capacity and lower inductive biases as well as their global receptive fields. Modern, improved and smaller ViTs such as PiT and SWIN are essentially becoming CNN-ified, by reducing receptive fields and using hierarchical pyramidal feature maps. However, CNNs are still on-par or better than SotA ViTs on ImageNet in terms of model complexity or size versus accuracy, especially when trained without knowledge distillation or extra data and when targeting lower accuracies.

Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!


[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems25 (2012): 1097-1105.

[2][2] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

[3] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[4] Radosavovic, Ilija, et al. “Designing network design spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[5] Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

[6] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International Conference on Machine Learning. PMLR, 2019.

[7] He, Xin, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A Survey of the State-of-the-Art.” Knowledge-Based Systems 212 (2021): 106622.

[8] Moons, Bert, et al. “Distilling optimal neural networks: Rapid search in diverse spaces.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

[9] Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[10] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[11] El-Nouby, Alaaeldin, et al. “XCiT: Cross-Covariance Image Transformers.” arXiv preprint arXiv:2106.09681 (2021).

[12] Heo, Byeongho, et al. “Rethinking spatial dimensions of vision transformers.” arXiv preprint arXiv:2103.16302 (2021).

[13] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

[14] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv preprint arXiv:2103.14030 (2021).

[15] Li, Yawei, et al. “Spatio-Temporal Gated Transformers for Efficient Video Processing.”, NeurIPS ML4AD Workshop, 2021

[16] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

[17] Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.

[18] Ross Wightman, “Pytorch Image Models”,, seen on January 10, 2022

Share on:

Scroll to top