Blog

Introducing Axelera AI’s New Advisor, Andreas Hansson

by Fabrizio Del Maffeo CEO at AXELERA AI

Andreas Hansson joined Axelera AI as an advisor last month. Andreas is an angel investor in several start-ups and serves on the board of several public companies. He will advise us on technology, market trends, and computing and artificial intelligence investment opportunities. To commemorate Andreas joining our team, we hosted a short interview to learn more about his background.

 

Andreas, thank you for joining us today. Before we jump into your career and accomplishments, can you tell us a bit more about you personally?

As a kid, I was always encouraged to be curious and inquisitive, and it has been a constant theme throughout my life. I spent much of my childhood taking things apart and making new things. I think it was this curiosity that sparked my interest in technology from a young age. I loved learning how something worked, building on it, or creating something different. To a large extent, it’s still what I love doing most today.

 

That curiosity has taken you to many great places. Why did you move from research into investment?

Thank you. Research is hugely exciting, and I enjoy the thrill of expanding my horizons with new technologies and innovations. Sometimes it can get detached from reality, though – it’s possible to get too focused on technology for technology’s sake. Getting more involved in the business decisions guiding the research and M&A activities grounded me in the purpose of all that research. I started to see that investment is a natural progression, and I love that it allows me to dive into all the aspects of a business. It’s a great place to be for a full-circle view.

 

You worked for two worldwide leaders in two completely different fields: first Arm, the biggest IP company in the world, and then Softbank, the largest VC in the world. What are the most important lessons you learned in these two experiences?

Arm taught me the value of partnership. The company’s astonishing success comes from, and still relies on, trust within the ecosystem. That trust and partnered work permeate the whole organisation. As a result, Arm is very collaborative, both internally and externally, and for me, it was a fantastic learning platform with tons of support.

One of my key takeaways from SoftBank was the power of thinking big and asking, “what if…?” It lit up the same inquisitive nature I had as a child. In some of my previous roles in engineering, I found myself getting a little too pragmatic and level-headed – important in some cases but stunting in others. Within SoftBank and the Vision Fund, I was surrounded by people pushing the envelope and truly thinking outside the box.

 

More and more startups are trying to enter the computing and AI semiconductor markets, proposing new architectures which always claim to be way more efficient and powerful than the incumbents. What is your opinion about this? Is there any secret sauce to succeed in this market?

Computing is permeating everything in our lives and is ever-evolving to deliver the right power/performance trade-off for each use case. For the same reasons, we are also seeing more changes in how computing systems are built, with novel architectures, technologies, manufacturing methods, etc. These developments present fantastic opportunities for startups to innovate and show what is possible. I actually think there are not enough semiconductor startups and also not enough semiconductor-focused DeepTech VCs.

 

After years of large investment rounds, it seems like the venture capital market is undergoing a correction. What is your opinion about this? What is the outlook for the coming 24 months?

VC activity is merely reflecting what’s happening in the markets broadly. I’m not surprised that priorities are shifting as everyone is working out what the world will look like going forward. While it will likely be a more challenging environment, and valuation expectations will come down, the next 24 months should ultimately present good investment opportunities for VCs.

 

What do you suggest to early-stage startups to do in this uncertain time when raising money?

The best thing startups can do is stay on top of their spending. If possible, secure 18-24 months of runway. Consider prioritising profitability instead of growth, and at the very least, work out a route to positive unit economics.

 

You recently departed from Softbank for a new great adventure – what is that?

Yes! I’m launching 2Q Ventures, a dedicated quantum computing fund in partnership with my stellar team. While I enjoy late-stage investment and my public-company board work, I’ve stayed really passionate about frontier technology, and helping visionaries transform the world. 2Q Ventures gives me a framework for doing exactly that while accelerating development and building up an ecosystem.

 

Quantum computing is an exciting field. When do you think quantum computing technology will become accessible to enterprises? Which market sector do you expect will be an early adopter of commercial quantum computing?

Excitingly, enterprises can already access quantum computers on the cloud through services like Amazon Braket. However, due to limited scale and relatively high noise rates, quantum computers don’t have a commercial advantage yet. That said, the progress is incredible. We also see signs of a virtuous circle, similar to machine learning in the mid 2000s, with technology progress leading to more investment, which in turn is getting more people involved, helping broaden the talent pool and seed new startups in the field, which in turn accelerates the next generation of achievements. It’s the perfect recipe for acceleration over the next few years. I wouldn’t be surprised if we see a true commercial advantage in areas like quantum simulation in the same time frame.

 
Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

Share on:

Facebook
Twitter
LinkedIn

Blog

Ten questions with Axelera AI’s Scientific Advisor Luca Benini

by Fabrizio Del Maffeo CEO at AXELERA AI

Professor Luca Benini is one of the foremost authorities on computer architecture, embedded systems, digital integrated circuits, and machine learning hardware. We’re honored to count him as one of our scientific advisors. Prof. Benini kindly agreed to answer a few questions for our followers on his research and the future of artificial intelligence.

For our readers who are unfamiliar with your work, can you give us a brief summary of  your career?

I am the chair of Digital Circuits and Systems at ETHZ, and I am a full professor at the Università di Bologna. I received a PhD from Stanford University, and I have been a visiting professor at Stanford University, IMEC, EPFL. I also served as chief architect at STMicroelectronics France.

My research interests are in energy-efficient parallel computing systems, smart sensing micro-systems and machine learning hardware. I’ve published more than 1.000 peer-reviewed papers and five books.

I am a Fellow of the IEEE, of the ACM and a member of the Academia Europaea. I’m the recipient of the 2016 IEEE CAS Mac Van Valkenburg Award, the 2019 IEEE TCAD Donald O. Pederson Best Paper Award, and the ACM/IEEE A. Richard Newton Award 2020. 

Which research subjects are you exploring?

I am extremely interested in energy-efficient hardware for machine learning and data-intensive computing. More specifically, I am passionate about exploring the trade-off between efficiency and flexibility. While everybody is aware of the fact that you can enormously boost efficiency with super-specialization, a super-specialized architecture will be narrow and short-lived, so we need flexibility. 

Artificial Intelligence requires a new computing paradigm and new data-driven architectures with high parallelisation. Can you share with us what you think the most promising directions are and what kind of new applications they can unleash?

I believe that the most impactful innovations are those that improve efficiency without over-specialization. For instance, using low bit-width representations reduces energy, but you need to have “transprecision,” i.e., the capability to dynamically adjust numerical precision. Otherwise, you won’t be accurate enough on many inference/training tasks, and then your scope of application may narrow down too much.

Another high-impact direction is related to minimising switching activity across the board. For instance, systolic arrays are very scalable (local communication patterns) but have huge switching activity related to local register storage. In-memory computing cores can do better than systolic arrays, but they are not a panacea. In general, we need to design architectures where we reduce the cost related to moving data in time and space. 

 

Can you share more with us about the tradeoffs and benefits of analog computing versus digital computing and where they can work together?

Analog computing is a niche, but a very important one. Ultimately, we can implement multiply-accumulate arrays very efficiently with analog computation, possibly beating digital logic, but it’s a tough fight. You need to do everything right (from interface and core computation circuits to precision selection to size). 

The critical point is to design the analog computing arrays in a way that can be easily ported to different technology targets without complete manual redesign. I view an analog computing core as a large-scale “special function unit” that needs to be efficiently interfaced with a digital architecture. So, it’s a “digital on top” design, with some key analog cores, that can win. 

Our sector has a prevailing opinion that Moore’s Law is dead. Do you agree, and how can we increase computing density?

The “traditional” Moore’s Law is dead, but scaling is fully alive and kicking through a number of different technologies — 2.5D, 3D die stacking, monolithic 3D, heterogeneous 3D, new electron devices, optical devices, quantum devices and more. This used to be called “More-than-Moore,” but I think it’s now really the cornerstone of scaling compute density – the ultimate goal. 

You are a very important contributor to the RISC-V community with your PULP platform, widely used in research and commercial applications. Why and when did you start the project, and how do you see it evolving in the next ten years?

I started PULP because I was convinced that the traditional closed-source computing IP market, and even more proprietary ISAs, were stifling innovation in many ways. I wanted to create a new innovation ecosystem where research could be more impactful and startups could more easily be created and succeed. I think I was right. Now the avalanche is in motion. I am sure that the open hardware and open ISA revolution will continue in the next ten years and change the business ecosystem, starting from more fragmented markets (e.g., IoT, Industrial) and then percolating to more consolidated markets (mobile, cloud). 

Can Europe play a leading role in the worldwide RISC-V community?

The EU can play a leading role. All the leading EU companies in the semiconductor business are actively exploring RISC-V, not just startups and academia. Of course, adoption will come in waves, but I think that some of the markets where the EU has strong leadership (automotive, IoT) are ripe for RISC-V solutions — as opposed to markets where the USA and Asia lead, such as mobile phones and servers which are much more consolidated. There is huge potential for the European industry in leveraging RISC-V.

What is the position of European universities and research centres versus American and Chinese in computing technologies – is there a gap, and how can the public sector help?

There is a gap, but it’s not quality; it’s in quantity. The number of researchers in computer architecture, VLSI, analog and digital circuits and systems in the EU is small in relation to USA and Asia. Unfortunately, these “demographic factors” take time to change. So really, the challenge is on academics to increase the throughput. Industry can play a role, too – for instance, leading companies can help found “innovation hubs” across Europe to increase our research footprint.

Companies can also help make Europe more attractive for jobs. Now that smart remote working is mainstream, people are not forced to move elsewhere. Good students in — for example — Italian or Spanish universities interested in semiconductors can find great jobs without moving. I am not saying that moving is bad, but if there are choices that do not imply moving away, more people will be attracted to these semiconductor companies and roles.

Is the European Chips Act powerful enough to change the trajectory of Europe within the global semiconductor ecosystem? 

It helps, but it’s not enough. There is no way to pump enough public money to make an EU behemoth at the scale of TSMC. But, if this money is well spent, it can “change the derivative” and create the conditions for much faster growth. 

Over the last decade, European semiconductor companies didn’t bring any cutting-edge computing technology to market. Is this changing, and do you think European startups can play a role in this change?

 I think that some large EU companies are, by nature, “competitive followers,” so disruptive innovation is not their preferred approach, even though of course there are exceptions.  The movement will come from startups, if they can attract the growth and funding of the larger companies. The emergence of a few European unicorns, as opposed to many small startups that just survive, will help Europe strengthen its position in the semiconductor market.

 
Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

Share on:

Facebook
Twitter
LinkedIn

Blog

multilayer perceptrons (MLP) in Computer Vision

by Bram VerhoefAlgorithm Architect at AXELERA AI

Summary – Convolutional neural networks (CNNs) still dominate today’s computer vision. Recently, however, networks based on transformer blocks have also been applied to typical computer vision tasks such as object classification, detection, and segmentation, attaining state-of-the-art results on standard benchmark datasets.

However, these vision-transformers (ViTs) are usually pre-trained on extremely large datasets and may consist of billions of parameters, requiring teraflops of computing power. Furthermore, the self-attention mechanism inherent to classical transformers builds on quadratically complex computations.

To mitigate some of the problems posed by ViTs, a new type of network based solely on multilayer perceptrons (MLPs), has recently been proposed. These  vision-MLPs (V-MLP) shrug off classical self-attention but still achieve global processing through their fully connected layers.

In this blog post, we review the V-MLP literature, compare V-MLPs to CNNs and ViTs, and attempt to extract the ingredients that really matter for efficient and accurate deep learning-based computer vision.

Introduction

In computer vision, CNNs have been the de facto standard networks for years. Early CNNs, like AlexNet [1] and VGGNet [2], consisted of a stack of convolutional layers, ultimately terminating in several large fully connected layers used for classification. Later, networks were made progressively more efficient by reducing the size of the classifying fully connected layers using global average pooling [3]. Furthermore these more efficient networks, among other adjustments, reduce the spatial size of convolutional kernels [4, 5], employ bottleneck layers and depthwise convolutions [5, 6], and use compound scaling of the depth, width and  resolution of the network [7]. These architectural improvements, together with several improved training methods [8] and larger datasets have led to highly efficient and accurate CNNs for computer vision.

Despite their tremendous success, CNNs have their limitations. For example, their small kernels (e.g., 3×3) give rise to small receptive fields in the early layers of the network. This means that information processing in early convolutional layers is local and often insufficient to capture an object’s shape for classification, detection, segmentation, etc. This problem can be mitigated using deeper networks, increased strides, pooling layers, dilated convolutions, skip connections, etc., but these solutions either lose information or increase the  computational cost. Another limitation of CNNs stems from the inductive bias induced by the weight sharing across the spatial dimensions of the input. Such weight sharing is modeled after early sensory cortices in the brain and (hence) is well adapted to efficiently capture natural image statistics. However, it also limits the model’s capacity and restricts the tasks to which CNNs can be applied.

Recently, there has been much research to solve the problems posed by CNNs by employing transformer blocks to encode and decode visual information. These so-called Vision Transformers (ViTs) are inspired by the success of transformer networks in Natural Language Processing (NLP) [9] and rely on global self-attention to encode global visual information in the early layers of the network. The original ViT was isotropic (it maintains an equal-resolution-and-size representation across layers), permutation invariant, based entirely on fully connected layers and relying on global self attention [10]. As such, the ViT solved the above-mentioned problems related to CNNs by providing larger (dynamic) receptive fields in a network with less inductive bias.

This is exciting research but it soon became clear that the ViT was hard to train, not competitive with CNNs when trained on relatively small datasets (e.g., IM-1K, [11]), and computationally complex as a result of the quadratic complexity of self-attention. Consequently, further studies sought to facilitate training. One approach was using network distillation [12]. Another was to insert CNNs at the early stages of the network [13]. Further attempts to improve ViTs re-introduced inductive biases found in CNNs (e.g., using local self attention [14] and hierarchical/pyramidal network structures [15]). There were also efforts to replace dot-product QKV-self-attention with alternatives [e.g. 16]. With these modifications now in place, vision transformers can compete with CNNs with respect to computational efficiency and accuracy, even when trained on relatively small datasets [see this blog post by Bert Moons for more discussion on ViTs].

Vision MLPs

Notwithstanding the success of recent vision transformers, several studies demonstrate that models building solely on multilayer perceptrons (MLPs) — so-called vision MLPs (V-MLPs) — can achieve surprisingly good results on typical computer vision tasks like object classification, detection and segmentation. These models aim for global spatial processing, but without the computationally complex self-attention. At the same time, these models are easy to scale (high model capacity) and seek to retain a model structure with low inductive bias, which makes them applicable to a wide range of tasks [17].

Like ViTs, the V-MLPs first decompose the images into non-overlapping patches, called tokens, which form the input into a V-MLP block. A typical V-MLP block consists of a spatial MLP (token mixer) and a channel MLP (channel mixer), interleaved by (layer) normalization and complemented with residual connections. This is illustrated in Figure 1.

Figure 1. Typical V-MLP structure. Adapted from [17].

Here the spatial MLP captures the global correlations between tokens, while the channel MLP combines information across features. This can be formulated as follows:

Y=spatialMLP(LN(X))+X,
Z=channelMLP(LN(Y))+Y,

Here X is a matrix containing the tokens, Y consists of intermediate features, LN denotes layer normalization, and Z is the output feature of the block. In these equations, spatialMLP and channelMLP can be any nonlinear function represented by some type of MLP with activation function (e.g. GeLU).

In practice, the channelMLP is often implemented by one or more 1×1 convolutions, and most of the innovation found in different studies lies in the structure of the spatialMLP submodule. And, here’s where history repeats itself. Where ViTs started as isotropic models with global spatial processing (e.g., ViT [10] or DeiT [12]), V-MLPs did so too (e.g., MLP-Mixer [17] or ResMLP [18]). Where recent ViTs improved their accuracy and performance on visual tasks by adhering to a hierarchical structure with local spatial processing (e.g., Swin-transformer [14] or NesT [19]), recent V-MLPs do so too (e.g., Hire-MLP [20] or S^2-MLPv2 [21]). These modifications made the models more computationally efficient (fewer parameters and FLOPs), easier to train and more accurate, especially when trained on relatively small datasets. Hence, over time both ViTs and V-MLPs re-introduced the inductive biases well known from CNNs.

Due to their fully connected nature, V-MLPs are not permutation invariant and thus do not necessitate the type of positional encoding frequently used in ViTs. However, one important drawback of pure V-MLPs is the fixed input resolution required for the spatialMLP submodule. This makes transfer to downstream tasks, such as object detection and segmentation, difficult. To mitigate this problem, some researchers have inserted convolutional layers or, similarly, bicubic interpolation layers, into the V-MLP (e.g., ConvMLP [22] or RaftMLP [23]). Of course, to some degree, this defies the purpose of V-MLPs. Other studies have attempted to solve this problem using MLPs only (e.g., [20, 21, 30]), but the data-shuffling needed to formulate the problem as an MLP results in an operation that is very similar or even equivalent to some form of (grouped) convolution.

See Table 1 for an overview of different V-MLPs. Note how some of the V-MLP models are very competitive with (or better than) state-of-the-art CNNs, e.g. ConvNeXt-B with 89M parameters, 45G FLOPs and 83.5% accuracy [28].

Table 1. Overview of some V-MLPs. For each V-MLP, we present the accuracy of the largest reported model that is trained on IM-1K only.

What matters?

It is important to note that the high-level structure of V-MLPs is not new. Depthwise-separable convolutions for example, as used in MobileNets [6], consist of a depthwise convolution (spatial mixer) and a pointwise 1×1 convolution (channel mixer). Furthermore, the standard transformer block comprises a self-attention layer (spatial mixer) and a pointwise MLP (channel mixer). This suggests that the good performance and accuracy obtained with these models results at least partly from the high-level structure of layers used inside V-MLPs and related models. Specifically, (1) the use of non-overlapping spatial patch embeddings as inputs, (2) some combination of independent spatial (with large enough spatial kernels) and channel processing, (3) some interleaved normalization, and (4) residual connections. Recently, such a block structure has been dubbed “Metaformer” ([24], Figure 2), referring to the high-level structure of the block, rather than the particular implementation of its subcomponents. Some evidence for this hypothesis comes from [27], who used a simple isotropic purely convolutional model, called “ConvMixer,” that takes non-overlapping patch embeddings as inputs. Given an equal parameter budget, their model shows improved accuracy compared to standard ResNets and DeiT. A more thorough analysis of this hypothesis was performed by “A ConvNet for the 2020s,” [28], which systematically examined the impact of block elements 1-4, finding a purely convolutional model reaching SOTA performance on ImageNet, even when trained on IN-1K alone.

Figure 2. a. V-MLP, b. Transformer and c. MetaFormer. Adapted from [24].

Conclusion

Taken together, these studies suggest that what matters for efficient and accurate vision models are the particular layer ingredients found in the Metaformer block (tokenization, independent spatial and channel processing, normalization and residual blocks) and the inductive biases typically found in CNNs (local processing with weight sharing and a hierarchical network structure). Clearly, this conclusion does not imply a special role for MLPs, as the Metaformer structure building on purely convolutional layers works (almost) just as well.

So are there other reasons for the recent focus on V-MLPs? The above-mentioned convolutional Metaformers were all tested on vision tasks and it is well known that the convolutional structure matches well with natural image statistics. Indeed, as mentioned above the best performing V-MLPs and ViTs (re-)introduce the inductive biases, such as local hierarchical processing, typically found in CNNs. However, if one is interested in a generic model that performs well in multimodal tasks and has lower computational complexity than standard transformers, an MLP-based network can be a good choice. For example, some initial results  show that MLP-based Metaformers also perform well on NLP tasks [18, 29].

An additional benefit of isotropic MLPs is that they scale more easily. This scalability can make it easier to implement them on compute infrastructure that relies on regular compute patterns. Furthermore, it facilitates capturing the high information content of large (multimodal) datasets.

So based on current findings we can formulate the following practical guidelines: for settings that are significanlty resource- and data-constrained, such as edge computing, there is currently little evidence that V-MLPs, like ViTs, are a superior alternative to CNNs. However, when datasets are large and/or multimodal, and compute is more abundant, pure MLP-based models may be a more efficient and generic choice compared to CNNs and transformer-based models that rely on self-attention.

We are still in the early days of examining the possibilities of MLP-based models. In just 9 months the accuracy of V-MLPs on ImageNet classification increased by a stunning ~8%. It is expected that these models will improve further and that hybrid networks, which properly combine MLPs, CNNs and attention mechanisms, have the potential to significantly outperform existing models (e.g. [30]). We are excited to be part of this future.

 
Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

 

References

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012): 1097-1105.

[2] Karen Simonyan, Andrew Zisserman: “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014; [http://arxiv.org/abs/1409.1556 arXiv:1409.1556].

[3] Min Lin, Qiang Chen, Shuicheng Yan: “Network In Network,” 2013; [http://arxiv.org/abs/1312.4400 arXiv:1312.4400].

[4] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, Kurt Keutzer: “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” 2016; [http://arxiv.org/abs/1602.07360 arXiv:1602.07360].

[5] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[6] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam: “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017; [http://arxiv.org/abs/1704.04861 arXiv:1704.04861].

[7] Mingxing Tan, Quoc V. Le: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” 2019, International Conference on Machine Learning, 2019; [http://arxiv.org/abs/1905.11946 arXiv:1905.11946].

[8] Ross Wightman, Hugo Touvron, Hervé Jégou: “ResNet strikes back: An improved training procedure in timm,” 2021; [http://arxiv.org/abs/2110.00476 arXiv:2110.00476].

[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2018; [http://arxiv.org/abs/1810.04805 arXiv:1810.04805].

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” 2020; [http://arxiv.org/abs/2010.11929 arXiv:2010.11929].

[11] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

[12] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou: “Training data-efficient image transformers & distillation through attention,” 2020; [http://arxiv.org/abs/2012.12877 arXiv:2012.12877].

[13] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, Ross Girshick: “Early Convolutions Help Transformers See Better,” 2021; [http://arxiv.org/abs/2106.14881 arXiv:2106.14881].

[14] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo: “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” 2021; [http://arxiv.org/abs/2103.14030 arXiv:2103.14030].

[15] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao: “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions,” 2021; [http://arxiv.org/abs/2102.12122 arXiv:2102.12122].

[16] Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira: “Perceiver: General Perception with Iterative Attention,” 2021; [http://arxiv.org/abs/2103.03206 arXiv:2103.03206].

[17] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy: “MLP-Mixer: An all-MLP Architecture for Vision,” 2021; [http://arxiv.org/abs/2105.01601 arXiv:2105.01601].

[18] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou: “ResMLP: Feedforward networks for image classification with data-efficient training,” 2021; [http://arxiv.org/abs/2105.03404 arXiv:2105.03404].

[19] Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan O. Arik, Tomas Pfister: “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding,” 2021; [http://arxiv.org/abs/2105.12723 arXiv:2105.12723].

[20] Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, Yunhe Wang: “Hire-MLP: Vision MLP via Hierarchical Rearrangement,” 2021; [http://arxiv.org/abs/2108.13341 arXiv:2108.13341].

[21] Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li: “S^2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision,” 2021; [http://arxiv.org/abs/2108.01072 arXiv:2108.01072].

[22] Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi: “ConvMLP: Hierarchical Convolutional MLPs for Vision,” 2021; [http://arxiv.org/abs/2109.04454 arXiv:2109.04454].

[23] Yuki Tatsunami, Masato Taki: “RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?” 2021; [http://arxiv.org/abs/2108.04384 arXiv:2108.04384].

[24] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan: “MetaFormer is Actually What You Need for Vision,” 2021; [http://arxiv.org/abs/2111.11418 arXiv:2111.11418].

[25] Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Yanxi Li, Chao Xu, Yunhe Wang: “An Image Patch is a Wave: Quantum Inspired Vision MLP,” 2021; [http://arxiv.org/abs/2111.12294 arXiv:2111.12294].

[26] Ziyu Wang, Wenhao Jiang, Yiming Zhu, Li Yuan, Yibing Song, Wei Liu: “DynaMixer: A Vision MLP Architecture with Dynamic Mixing,” 2022; [http://arxiv.org/abs/2201.12083 arXiv:2201.12083].

[27] Asher Trockman, J. Zico Kolter: “Patches Are All You Need?” 2022; [http://arxiv.org/abs/2201.09792 arXiv:2201.09792].

[28] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie: “A ConvNet for the 2020s,” 2022; [http://arxiv.org/abs/2201.03545 arXiv:2201.03545].

[29] Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le: “Pay Attention to MLPs,” 2021; [http://arxiv.org/abs/2105.08050 arXiv:2105.08050].

[30] Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou: “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs,” 2022; [http://arxiv.org/abs/2202.06510 arXiv:2202.06510].

Share on:

Facebook
Twitter
LinkedIn

Blog

INTERVIEW WITH TORSTEN HOEFLER Axelera AI’s Scientific Advisor

by Evangelos Eleftheriou –  CTO at AXELERA AI

Out CTO had a chat with Torsten Hoefler to scratch the surface and get to know better our new scientific advisor.

Evangelos: Could you please introduce yourself and your field of expertise?

Torsten: My background is in High-Performance Computing on Supercomputers. I worked on large-scale supercomputers, networks, and the Message Passing Interface specification. More recently, my main research interests are in the areas of learning systems and applications of them, especially in the climate simulation area.

E: Where is currently the focus of your research interests?

T: I try to understand how to improve the efficiency of deep learning systems (both inference and training) ranging from smallest portable devices to largest supercomputers. I especially like the application of such techniques for predicting the weather or future climate scenarios.

E: What do you see as the greatest challenges in data-centric computing in current hardware and software landscape?

T: We need a fundamental shift of thinking – starting from algorithms, where we teach and reason about operational complexity. We need to seriously start thinking about data movement. From this algorithmic base, the data-centric view needs to percolate into programming systems and architectures. On the architecture side, we need to understand the fundamental limitations to create models to guide algorithm engineering. Then, we need to unify this all into a convenient programming system.

 

E: Could you please explain the general concept of DaCe, as a generic data-centric programming framework?

T: DaCe is our attempt to capture data-centric thinking in a programming system that takes Python (and others) codes and represents them as a data-centric graph representation. Performance engineers can then work conveniently on this representation to improve the mapping to specific devices. This ensures highest performance.

E: DaCe has also extensions for Machine Learning (DaCeML). Where do those help? Could in general in-memory computing accelerators benefit by such a framework and how?

T: DaCeML supports the Open Neural Network Exchange (ONNX) format and PyTorch through the ONNX exporter. It offers inference as well as training support at highest performance using data-centric optimizations. In-memory computing accelerators can be a target for DaCe – depending on their offered semantics, a performance engineer could identify pieces of the dataflow graph to be mapped to such accelerators.

E: In which new application domains do you see data-centric computing playing a major role in the future?

T: I would assume all computations where performance or energy consumption is important – ranging from scientific simulations to machine learning and from small handheld devices to large-scale supercomputers.

E: What is your advice to young researchers in the field of data-centric optimization?

T: Learn about I/O complexity!

As Scientific Advisor, Torsten Hoefler advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Torsten’s work, please visit his biography page.

Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

Share on:

Facebook
Twitter
LinkedIn

Blog

Transformers in Computer Vision


by Bert Moons –  System Architect at AXELERA AI


Summary: Convolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices.



Convolutional Neural Networks

Convolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3],  RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.

A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:

  1. They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field,
  2. Training is stabilized by using batch-normalization and residual connections.
  3. Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x.
  4. Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last

Figure 1-1: Illustration of ResNet34 [3]

Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline[8].

Transformers in Computer Vision

A more radical evolution in Neural Networks for Computer Vision, is the move towards using Vision Transformers (ViT)[9] as a CNN-backbone replacement. Inspired by the astounding performance of Transformer models in Natural Language Processing (NLP)[10], research has moved towards applying the same principles in Computer Vision. Notable examples, among many others, are XCiT[11], PiT[12], DeiT[13] and SWIN-Transformers[14]. Here, analogously to NLP processing, images are essentially treated as sequences of image patches, by modeling feature maps as vectors of tokens, each token representing an embedding of a specific image patch.


Figure 1-2: Illustration of the original basic vision transformer (ViT), taken from [10]

 

 

Figure 1-3: Illustration of a self-attention module. K, Q and V are linear projections of the same input feature map. The attention map is a softmax function of the matrix product QKT. Image taken from source[15].

An illustration of a basic ViT is given in Figure 1-2.  The ViT is a sequence of stacked MLPs and self-attention layers, with or without residual connections . This ViT uses the multi-headed self-attention mechanism developed for NLP Transformer, see Figure 1-3. Such self-attention layer has two distinguishing features. It can (1) dynamically ‘guide’ its attention by dynamically reweighting the importance of specific features depending on the context and (2) has a full receptive field in case global self-attention is used. The latter is the case when self-attention is applied across all possible input tokens. Here all tokens, representing embeddings related to specific spatial image patches, are correlated with each other, giving a full perspective field. Global self-attention is typical in ViTs, but not a requirement. Self-attention can also be made local, by limiting the scope of the self-attention module to a smaller set of tokens, in turn reducing the operation’s receptive field at a particular stage.

This ViT architecture contrasts strongly with CNNs. In vanilla CNNs without attention mechanisms, (1) features are statically weighted using pretrained weights, rather than dynamically reweighted based on the context as in ViTs and and (2) receptive fields of individual network layers are typically local and limited by the convolutional kernel size.

Part of the success of CNNs, is their strong architectural inductive bias implied in the convolutional approach. Convolutions with shared weights explicitly encode how specific identical patterns are repeated in images. This inductive bias ensures easy training convergence on relatively small datasets, but also limits the modeling capacity of CNNs. Vision Transformers do not enforce such strict inductive biases. This makes them harder to train, but also increases their learning capacity, see Figure 1-5. To achieve good results using ViTs in Computer Vision, these networks are often trained using knowledge distillation with a large CNN-based teacher (as in DeiT[16] for example). This way, part of the inductive bias of CNNs can be more softly forced into the training process.

Initially, ViTs where directly inspired by NLP Transformers: massive models with a uniform topology and global self-attention, see Figure 1-4 (b). Recent ViTs have a macro-architecture that is closer to that of CNNs (Figure 1-4 (a)), using hierarchical pyramidal feature maps (as in PiT (Footnote 12); see Figure 1-4 (c)) and local self-attention (as in Swin-Transformers (Footnote 14). A high-level overview of this evolution is discussed in Table 1.)

 


Figure 1-4: comparing the dimension configurations of networks of (a) ResNet-50, a classical CNN with pyramidal feature maps, (b) an early ViT-S/16 [10] with a uniform macro-architecture and (c) a modern PiT-S [Footnote 12] with CNN-ified pyramidal feature maps. Figure taken from [Footnote 12].

 

Table 1: Comparing early ViTs, recent ViTs and modern CNNs


 

Comparing CNNs and ViTs for Edge Computing

Even though ViTs have shown State-of-the-Art (SotA) performance in many Computer Vision tasks, they do not necessarily outperform CNNs across the board. This is illustrated in Figure 1-5 and Figure 1-6. These figures compare the performance of ViTs and CNNs in terms of ImageNet validation accuracy versus model size and complexity, for various training regimes. It’s important to distinguish between these training regimes, as not all training methodologies are feasible for specific downstream tasks. First, for some applications there are only relatively small datasets available. In that case, CNNs typically perform better. Second, many ViTs rely on distillation approaches to achieve high performance. For that to work, they need a highly-accurate pretrained CNN as a teacher, which is not always available.

Figure 1-5 (a) illustrates how CNNs and ViTs compare in terms of model size versus accuracy if all types of training are allowed, including distillation approaches and using additional data (such as JFT-300[17]). Here ViTs perform on-par or better than large-scale CNNs, outperforming them in specific ranges. Notably, XCiT (Footnote 11) models perform particularly well in the +/- 3M-Parameters range. However, when neither distillation, nor training on extra data is allowed, the difference is less pronounced, see Figure 1-5 (b). In both Figures, EfficientNet-B0 and ResNet-50 are indicated as references for context.

 


Figure 1-5: Comparing CNNs to ViTs in terms of model size (# Params) and ImageNet Top-1 Validation accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without CNN-based knowledge distillation , but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates the lasting competivity of CNNs over ViTs, especially in the Edge domain for models with less than 25M parameters where performance is very similar between CNNs and ViTs. ResNet-50 and EfficientNet-B0 are given as reference points. Data is taken from this source[18] and the respective scientific papers.

 

Figure 1-6 illustrates the same in terms of accuracy versus model complexity for a more limited set of known networks. Figure 1-6(a) and (b) show CNNs are mostly dominant for lower accuracies and networks with lower complexity (<1B FLOPS) for all types of training.  This holds even for CNN-ified Vision-Transformers such as PiT (Footnote 12) which use a hierarchical architecture with pyramidal feature maps and for SWIN transformers which optimize complexity by using local self-attention.  Without extra data or distillation, CNNs typically outperform ViTs across the board, especially for networks with a lower complexity or for networks with accuracies lower than 80%. For example, at a similar complexity, both RegNets and EfficientNet-style networks significantly outperform XCiT ViTs, see Figure 1-6 (b).

 


Figure 1-6: Comparing SotA CNNs to ViTs in terms of computational cost (# FLOPS) and ImageNet Top-1 Validation Accuracy. (a) Shows data for all types of training: (i) training on ImageNet1k training data, (ii) using extra data such as ImageNet21k or JFT [17] and (iii) training using knowledge distillation using a CNN teacher. (b) Shows data for a subset of networks that are trained from-scratch, without extra data or knowledge distillation, but with state-of-the-art training techniques on ImageNet. Figure (b) illustrates how CNNs are still dominant in the <80% accuracy regime. Even CNN-ified modern ViTs with hierarchical pyramidal models such as PiT [Footnote 12] do not outperform EfficientNet [Foonote 6] and RegNet [Footnote 4] style CNNs. In the 80%+ range, networks with local self-attention such as SWIN [Foonote 14] are on par or better than RegNets [Footnote 4]. Data is taken from Footnote 16  and the respective scientific papers.

 

Apart from the high-level differences in Table 1 and the performance differences in this section, there are some other key different requirements in bringing ViTs to edge devices. Compared to CNNs, ViT rely much more on 3 specific operations that must be properly accelerated on-chip. First, ViTs rely on accelerated softmax operators as part of self-attention, while CNNs only require softmax as the final layer in a classification network. On top of that, ViTs typically use smooth-nonlinear activation functions, while CNNs mostly rely on Rectified Linear Units (ReLU) which are much cheaper to execute and accelerate. Finally, ViTs typically require LayerNorm, a form of layer normalization with dynamic computation of mean and standard deviation to stabilize training. CNNs however, typically use batch-normalization, which must only be computed during training and can essentially be ignored in inference by folding the operation into neighbouring convolutional layers.

 

Conclusion

Vision Transformers are rapidly starting to dominate many applications in Computer Vision. Compared to CNNs, they achieve higher accuracies on large data sets due to their higher modeling capacity and lower inductive biases as well as their global receptive fields. Modern, improved and smaller ViTs such as PiT and SWIN are essentially becoming CNN-ified, by reducing receptive fields and using hierarchical pyramidal feature maps. However, CNNs are still on-par or better than SotA ViTs on ImageNet in terms of model complexity or size versus accuracy, especially when trained without knowledge distillation or extra data and when targeting lower accuracies.

Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!


References

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems25 (2012): 1097-1105.

[2][2] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

[3] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[4] Radosavovic, Ilija, et al. “Designing network design spaces.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[5] Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

[6] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International Conference on Machine Learning. PMLR, 2019.

[7] He, Xin, Kaiyong Zhao, and Xiaowen Chu. “AutoML: A Survey of the State-of-the-Art.” Knowledge-Based Systems 212 (2021): 106622.

[8] Moons, Bert, et al. “Distilling optimal neural networks: Rapid search in diverse spaces.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021

[9] Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[10] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[11] El-Nouby, Alaaeldin, et al. “XCiT: Cross-Covariance Image Transformers.” arXiv preprint arXiv:2106.09681 (2021).

[12] Heo, Byeongho, et al. “Rethinking spatial dimensions of vision transformers.” arXiv preprint arXiv:2103.16302 (2021).

[13] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

[14] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” arXiv preprint arXiv:2103.14030 (2021).

[15] Li, Yawei, et al. “Spatio-Temporal Gated Transformers for Efficient Video Processing.”, NeurIPS ML4AD Workshop, 2021

[16] Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International Conference on Machine Learning. PMLR, 2021.

[17] Sun, Chen, et al. “Revisiting unreasonable effectiveness of data in deep learning era.” Proceedings of the IEEE international conference on computer vision. 2017.

[18] Ross Wightman, “Pytorch Image Models”, https://github.com/rwightman/pytorch-image-models, seen on January 10, 2022

Share on:

Facebook
Twitter
LinkedIn

Blog

An Interview with Marian Verhelst, Axelera AI’s Scientific Advisor

by Fabrizio Del Maffeo – CEO of AXELERA AI

I met Marian Verhelst in the summer of 2019 and she immediately stroke me with her passion and competence for computing architecture design. We started immediately a collaboration and today she’s here with us sharing her insights on the future of computing.

 

F: There are different approaches and trends in new computing designs for artificial intelligence workloads: increasing the number of computing cores from a few to tens, thousands or even hundreds of thousands of small, efficient cores, as well as near-memory processing, computing-in-memory, or in-memory computing. What is your opinion about these architectures? What do you think is the most promising approach? Are there any other promising architecture developments?

M: Having seen the substantial divergence in ML algorithmic workloads and the general trends in the processor architecture field, I am a firm believer in very heterogeneous multi-core solutions. This means that future processing systems will have a large number of cores with very different natures. Eventually, such cores will include (digital) in- or near-memory processing cores, coarse grain reconfigurable systolic arrays and more traditional flexible SIMD cores. Of course, the challenge is to build compilers and mappers that can grasp all opportunities from such heterogeneous and widely parallel fabrics. To ensure excellent efficiency and memory capabilities, it will be especially important to exploit the cores in a streaming fashion, where one core immediately consumes the data produced by another.

 

F: Computing design researchers are working on low power and ultra-low power consumption design using metrics such as TOPs/w as a key performance indicator and low precision networks trained mainly on small datasets. However, we also see neural network research increasingly focusing on large networks, particularly transformer networks that are gaining traction in field deployment and seem to deliver very promising results. How can we conciliate these trends? How far are we from running these networks at the edge? What kind of architecture do you think can make this happen?

M: There will always be people working to improve energy efficiency for the edge and people pushing for throughput across the stack. The latter typically starts in the data centre but gradually trickles down to the edge, where improved technology and architectures enable better performance. It is never a story of choosing one option over another.
Over the past years, developers have introduced increasingly distributed solutions, dividing the workload between the edge and the data centre. The vital aspect of these presented solutions is that they need to work with scalable processor architectures. Developers can deploy these architectures with a smaller core count at the extreme edge and scale up to larger core numbers for the edge and a massive core count for the data centre. This will require  processing architectures and memory systems that rely on a mesh-type distributed processor fabric, rather than being centrally controlled by a single host.

 

F: How do you see the future of computing architecture for the data centre? Will it be dominated by standard computing, GPU, heterogeneous computing, or something else?

M: As I noted earlier, I believe we will see an increasing amount of heterogeneity in the field. The data centre will host a wide variety of processors and differently-natured accelerator arrays to cover the widely different workloads in the most efficient manner possible. As a hardware architect, the exciting and still open challenge is what library of (configurable) processing tiles can cover all workloads of interest. Most intriguing is that, due to the slow nature of hardware development, this processor library should cover not only the algorithms we know of today but also those that researchers will develop in the years to come.

 

As Scientific Advisor, Marian Verhelst advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Marian’s work, please visit her biography page.

Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

Share on:

Facebook
Twitter
LinkedIn

Blog

What's Next for Data Processing? A Closer Look at In-Memory Computing

by Evangelos EleftheriouCTO of AXELERA AI

Technology is progressing at an incredible pace and no technology is moving faster than Artificial Intelligence (AI). Indeed, we are on the cusp of an AI revolution which is already reshaping our lives. One can use AI technologies to automate or augment humans, with applications including autonomous driving, advances in sensory perception and the acceleration of scientific discovery using machine learning. In the past five years, AI has become synonymous with Deep Learning (DL), another area seeing fast and dramatic progress. We are at a point where Deep Neural Networks (DNNs) for image and speech recognition can provide accuracy on par or even better than that achieved by the human brain.

Most of the fundamental algorithmic developments around DL go back decades. However, the recent success has stemmed from the availability of large amounts of data and immense computing power for training neural networks. From around 2010, the exponential increase of single-precision floating point operations offered by Graphic Processing Units (GPUs) ran in parallel to the explosion of neural network sizes and computational requirements. Specifically, the amount of compute used in the largest AI training has doubled every 3.5 months during the last decade. At the same time, the size of state-of-the-art models increased from 26M weights for ResNet-50 to 1.5B for GPT-2. This phenomenal increase in model size is reflected directly in the cost of training such complex models. For example, the cost of training the bidirectional transformer network BERT, for Natural Language Processing applications, is estimated at $61,000, whereas training XLNet, which outperformed BERT, costs about nine times as much. However, a major concern is not only the cost associated with the substantial energy consumption needed to train complex networks but also the significant environmental impact incurred in the form of CO2 emissions.

As the world looks to reduce carbon emissions, there is an even greater need for higher performance with lower power consumption. This is true not only for AI applications in the data center, but also at the Edge, which is where we expect the next revolution to take place. AI at the Edge refers to processing of data where it is collected, as opposed to requiring data to be moved to separate processing centers. There is a wealth of applications at the edge: AI for mobile devices, including authentication, speech recognition, and mixed/augmented reality, AI for embedded processing for IoT devices, including smart cities and homes or embedded processing for prosthetics, wearables, and personalized healthcare, as well as AI for real-time video analytics for autonomous navigation and control. However, these embedded applications are all energy and memory constrained, meaning energy efficiency matters even more so at the Edge. The end of Moore’s and Dennard’s laws are compounding these challenges. Thus, there are compelling motivations to explore novel computing architectures with inspiration from the most efficient computer on the planet, the human brain.


Traditional Computing Systems: Current State of Play

Traditional digital computing systems, based on the von Neumann architecture, consist of separate processing and memory units. Therefore, performing computations typically results in a significant amount of data being moved back and forth between the physically separated memory and processing units. This data movement costs latency and energy and creates an inherent performance bottleneck. The latency associated with the growing disparity between the speed of memory and processing units, commonly known as the memory wall, is one example of a crucial performance bottleneck for a variety of AI workloads. Similarly, the energy cost associated with shuttling data represents another key challenge for computing systems that are severely power limited due to cooling constraints as well as for the plethora of battery-operated mobile devices. In general, the energy cost of multiplying two numbers is orders of magnitude lower than that of accessing numbers from memory. Therefore, it is clear to AI developers that there is a need to explore novel computing architectures that provide better collocation of processing and memory subsystems. One suggested concept in this area is near-memory computing, which aims to reduce the physical distance and time needed to access memory. This approach heavily leverages recent advances made in die stacking and new technologies such as the high memory cube (HMC) and high bandwidth memory (HBM).


In-Memory Computing: A Radical New Approach

In-memory computing is a radically different approach to data processing, in which certain computational tasks are performed in place in the memory itself (Sebastian 2020). This is achieved by organizing the memory as a crossbar array and by exploiting the physical attributes of the memory devices. The peripheral circuitry and the control logic play a key role in creating what we call an in-memory computing (IMC) unit or computational memory unit (CMU). In addition to overcoming the latency and energy issues associated with data movement, in-memory computing has the potential to significantly improve the computational time complexity associated with certain computational tasks. This is primarily a result of the massive parallelism created by a dense array of millions of memory devices simultaneously performing computations.

For instance, crossbar arrays of such memory devices can be used to store a matrix and perform matrix-vector multiplications (MVMs) at constant O(1) time complexity without intermediate movement of data. The efficient matrix-vector multiplication via in-memory computing is very attractive for training and inference of deep neural networks, particularly for inference applications at the Edge where high energy efficiency is critical. In fact, matrix-vector multiplications constitute 70-90% of all deep learning operations. Thus, applications requiring numerous AI components such as computer vision, natural language processing, reasoning and autonomous driving can explore this new technology in new and innovative ways. Novel dedicated hardware with massive on-chip memory, where part of it is enhanced with in-memory computation capabilities could lead to very efficient training and inference engines of ultra-large neural networks comprising of potentially billions of synaptic weights.

The core technology of IMC is memory. In general, there are two classes of memory devices. The conventional one, in which information is stored in the presence or absence of charge, includes dynamic random-access memory (DRAM), static random-access memory (SRAM) and Flash memory. There is also an emerging class of memory devices, in which information is stored in terms of the atomic arrangements within nanoscale volumes of materials, as opposed to charge on a capacitor. Generally speaking, one atomic configuration corresponds to one logic state, and the other corresponds to another logic state. These differences in atomic configuration manifest as a change in resistance, and thus these devices are collectively called resistive memory devices or memristors. Traditional and emerging memory technologies can perform a range of in-memory logic and arithmetic operations. In addition, SRAM, Flash and all memristive memories can also be used for MVM operations.

Table 1 – Comparing different memory technologies. Sources:(B. Li 2019), (Marinella 2013)

Which Memory Technology for Which Operation? Considerations to Keep in Mind

There are many trade-offs involved in selecting which memory technology is suitable for MVM operations for the target DL workloads. For example, read latency, to a large extent, determines the performance of the system, also known as throughput, in operations per second (OPS). This means it also indirectly affects the system’s efficiency, measured in OPS/W. On the other hand, memory volatility, as well as the write time, determine whether the system supports static or reloadable weights. Cycling endurance is another important characteristic to keep in mind, as it determines the suitability of a memory technology for training and/or inference applications. For example, the limited endurance of PCM, ReRAM and Flash memory devices precludes them from DL training applications. The cell size also has an impact on the compute density. Specifically, it affects the die area and therefore the ASIC cost.

It is also important to look at temperature stability, drift phenomena and noise effects. In general, all memory devices exhibit intra-device variability and randomness that is intrinsic to how they operate. However, resistive memory devices appear to be more prone to noise (read and write), nonlinear behaviour, inter-device variability and inhomogeneity across an array. Thus, the precision achieved when using memristive technologies for analogue matrix-vector operations is typically not very high and requires the use of additional hardware-aware training techniques to achieve FP32-equivalent accuracies. Finally, the compatibility of the manufacturing process for memory devices with the CMOS technology and their scalability to lower lithography nodes are very important considerations for the successful commercialization of IMC technology and its future roadmap.

SRAM has a unique advantage in that it exhibits the fastest read and write time and highest endurance compared to other memory devices. Thus, SRAM enables high performance and reprogrammable IMC engines for both inference and training applications. Moreover, SRAM follows the scaling of CMOS technology to low lithography nodes and requires standard materials and processes that are readily available to foundries. On the other hand, it is a volatile memory technology that consumes energy not only when it is at the idle state but also for data retention. In addition, SRAM’s cell size, approximately 100 F2, is the largest of all charge- and resistance-based memory technologies. However, volatility is not a serious drawback, as the applications very rarely dictate static models. In fact, the fast write time of SRAM is a crucial advantage, allowing it to alternate DL models through very fast re-programmability. Finally, from a system architecture standpoint, due to the fast re-programmability of SRAM, there is no need to map the entire DNN onto multiple crossbar arrays of memory devices that would result in a large and costly ASIC.

Recently, IMEC reported an SRAM-based IMC Multiply-Accumulate unit (MAC) with a record energy efficiency of 2900 TOPS/W using ternary weights (imec 2020). There are also experimental prototype SRAM demonstrators that support INT8 activations and weights whose precision scales linearly with latency, power consumption and area. Clearly, the in-memory MAC implementation and operation are only one part of a multi-faceted IMC-based system. Other digital units are needed to support element-wise vector processing operations, including activation functions, depth-wise convolution, affine scaling, batch normalization and more. Moreover, the performance and usability of a multicore IMC engine also depends on multiple characteristics: optimized memory hierarchy, well-balanced fabric, fine-tuned quantization flow, optimized weight-mapping strategies and a versatile compiler and software tool chain.

There have been a lot of advancements made in the computing sector, with even more to come. Our customers, and the industry as a whole, have made it clear that they would like to have a system that offers high throughput, high efficiency and high accuracy – the three highs -, which is also easy to use and of course, cost-effective. At Axelera AI, we are working to design a system that offers all these capabilities and much more. Our AI solution will be based on a novel multicore in-memory computing paradigm combined with an innovative custom dataflow architecture. 

Stay tuned to learn more about our progress in upcoming blog posts, and be sure to subscribe to our newsletter using the form on our homepage!

 

References

B. Li, B. Yan, H. Li. 2019. “An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications.” Great Lakes Symposium on VLSI.

imec. 2020. Imec and GLOBALFOUNDRIES Announce Breakthrough in AI Chip, Bringing Deep Neural Network Calculations to IoT Edge Devices. Jul. Accessed Nov 2021. https://www.imec-int.com/en/articles/imec-and-globalfoundries-announce-breakthrough-in-ai-chip-bringing-deep-neural-network-calculations-to-iot-edge-devices.

Marinella, M. 2013. “ERD Memory Planning – updated from last weeks telecon.”

Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R., Eleftheriou, E. 2020. “Memory devices and applications for in-memory computing.” Nature Nanotechnoly.

Share on:

Facebook
Twitter
LinkedIn

Blog

Artificial Intelligence At The Edge: Data-Driven Decision-Making Is Here To Change The World

More than 125 billion “things” are expected to be connected to the internet of things (IoT) by 2030. From the nearly 4 million smartphones in the world to the tiniest camera sensors in local traffic lights, each of these devices will generate exponential amounts of data for analysis.

Data is the new oil and is the most valuable asset for tech giants like Facebook, Google and Amazon. The amount of data-heavy video and images shared on the internet is rapidly increasing, estimated to make up more than 80% of internet traffic by the end of 2021.  According to Cisco, 50% of the data produced to date was generated in the last two years. However, only 2% of this staggering amount of data has been analyzed due to a lack of available and accessible tools and hardware, leaving companies to wonder what they can do to address this data gap.

Artificial intelligence, or AI, offers a compelling solution to this problem. Still, it requires increasingly complex and powerful algorithms to analyze these massive amounts of data efficiently. Powerful AI is not enough on its own – due to growing privacy, security and bandwidth concerns, stakeholders increasingly need to process data close to its origin, often on the sensors/devices themselves, in what is called the “edge” of  IoT. 

The AI technology available today has been designed primarily for cloud computing operations, a sector with considerably less constraints in terms of cost, power, and scalability. For years, Incumbent computing companies have delivered inefficient and expensive computing technologies, opening the door for startups to propose new technologies. These innovative solutions aim to match this new data-driven computing era’s specific power needs, computational requirements and economics. 

The market opportunity is significant – the AI semiconductor market (for application-specific processors) is expected to reach more than $30 billion in 2023 and more than $50 billion in 2025, with the AI computing board and systems market estimated to be three to four times larger.

Figure 1 – Artificial Intelligence market opportunity. Source: Axelera AI .

80% of the current market is represented by chips that train the artificial neural networks typically used in cloud computing and large data centres owned by companies like Microsoft, Amazon, Google and Facebook. However, experts expect most of the market to shift to inference at the edge in the coming months.

This new generation of hardware for AI at the edge needs to address several challenges currently faced by developers.

Challenge 1: Standard computing performance is facing an end to its exponential growth.

According to Moore’s law, Amdahl’s law and Dennard scaling, computer performance has grown exponentially for 30 years. In looking carefully at data from the past 15 years, however, it is apparent that this growth has slowed down to almost flatten, especially in the previous five years.  

Challenge 2: Neural network size is increasing exponentially.  

While standard computing performance is slowing down, neural network size and computational requirements are increasing exponentially at a swift pace. In five years, the most advanced neural network increased in size by over 1,000 times. Similarly, the computational requirements to train the most advanced networks is doubling every three months, which amounts to over 1,000 times every two and a half years.

Challenge 3: Computer technology is not optimized for AI workloads.   

The standard CPU (Central Processing Unit) design is not well suited to meet today’s data processing needs. Matrix-vector multiplications dominate AI workloads where 70% of the workload consists of multiplying large tables of numbers and accumulating the results of these multiplications.  

Challenge 4: Technology is inefficient, leading to a data bottleneck. 

Data movement is the key driving factor behind artificial intelligence’s computer performance and power consumption, particularly in deep learning. AI processes constantly move data from the computer’s memory to the CPU, where operations such as multiplications or sums are performed, and then back to the memory where the partial or final result is stored. AI requires a new technology that should reduce data movement and optimize data flow within its system.

Figure 2 – Challenges of Artificial Intelligence at the Edge. 

Properly addressing the above challenges and delivering new products based on modern computing architecture will unleash cutting-edge new applications and scenarios, including retail, security, smart cities and more. Here are a few examples of the areas AI at the edge has the potential to unlock.  

Mobility: The mobility market is one of the larger current markets for AI. This includes autonomous driving, driving assistant systems, driver attention control, fleet management, passenger counting, commercial payload and perimeter control. 

Retail: Retail automation is another one of the fastest-growing markets for AI at the edge. It impacts all shopping centres from supermarkets to local stores and vending machines. Typical applications in this area include interactive digital signage, customer analytics, product analytics, autonomous checkout systems and autonomous logistics.

Security: The are more than 500 million public and private cameras in the world. Most of these systems do not transfer video to the cloud. Instead, the detection and crowd tracking is done by a computer in the camera’s proximity (in the case of the shops and indoor areas) or within a private network. 

Figure 3 – Edge AI Market opportunity. 

Smart City: According to the UN, 68% of the world’s population will live in urban areas by 2050. This unprecedented migration is forcing city and metropolitan area planners to rethink the way people live and how cities develop. AI is helping to collect insight and analyze data from cameras and sensors for applications such as intelligent traffic systems, intelligent lighting systems, intelligent parking systems and crowd analytics. 

Personal Safety: Artificial intelligence gives us the tools needed to improve safety in working areas and private life. Camera systems can limit access to restricted areas to authorized personnel, limit access to specific devices or machines to authorized people (using biometrics for identification), promptly identify employees in danger and more. Augmented reality will also allow people to learn how to more efficiently and safely operate new tools.

Robotics & Drones: Artificial intelligence at the edge is powering drones and robots used across logistics, manufacturing and many other sectors. Drones can survey and help businesses operate efficiently and safely over large areas with challenging environmental conditions. These inventions will radically change several vital areas, including agriculture, environmental control and logistics.  

Manufacturing: Manufacturers have used computer vision to optimize their processes for decades. All systems operate in an isolated manner to help limit the risk of complete manufacturing line failure. Deep learning is continuously introducing new possibilities and helping achieve higher manufacturing standards and output. 

Healthcare: Today, AI can help accurately identify early-stage skin cancers and other diseases with a success rate similar to that of an experienced radiologist. Features like these rely on powerful cloud computers that will soon be available to edge computers outside the cloud. 

The examples above illustrate only some of the numerous areas enhanced by AI and data-driven decision-making. The semiconductor market seems to have entered its Compute Cambrian Explosion era, with hundreds of newborn fabless semiconductor startups proposing new solutions every day. It is challenging to determine what technology and company will “win” this race and if only one winner will emerge.

We believe that heterogeneous architectures which merge different technologies will ultimately prevail. Dataflow computing and in-memory computing can deliver an optimal solution to fulfil market needs and provide cost-effective, robust and efficient hardware.

While tradition computing systems move data from memory to the computing unit and store back the result in memory, data flow in-memory computing technology allows to process data directly inside the memory cell – reducing drastically the data movement and consequently power consumption – and to perform, in just one computing cycle millions of operations.

Furthermore, the combination of the computing and memory reduces the footprint of the chip and consequently the cost of the chip. Combining multiple In-Memory Computing cores with dataflow makes it possible to develop a versatile technology which can support all the most used applications (networks) in the field of computing vision and natural language processing and delivering high throughput & efficiency at a fraction of the cost of current solutions.

Interested in learning more about this topic? In our next article, our CTO will explore the nuanced world of in-memory computing. Subscribe here to follow our blog and receive email notifications when we post next.

Share on:

Facebook
Twitter
LinkedIn