by Evangelos Eleftheriou – CTO of AXELERA AI
Technology is progressing at an incredible pace and no technology is moving faster than Artificial Intelligence (AI). Indeed, we are on the cusp of an AI revolution which is already reshaping our lives. One can use AI technologies to automate or augment humans, with applications including autonomous driving, advances in sensory perception and the acceleration of scientific discovery using machine learning. In the past five years, AI has become synonymous with Deep Learning (DL), another area seeing fast and dramatic progress. We are at a point where Deep Neural Networks (DNNs) for image and speech recognition can provide accuracy on par or even better than that achieved by the human brain.
Most of the fundamental algorithmic developments around DL go back decades. However, the recent success has stemmed from the availability of large amounts of data and immense computing power for training neural networks. From around 2010, the exponential increase of single-precision floating point operations offered by Graphic Processing Units (GPUs) ran in parallel to the explosion of neural network sizes and computational requirements. Specifically, the amount of compute used in the largest AI training has doubled every 3.5 months during the last decade. At the same time, the size of state-of-the-art models increased from 26M weights for ResNet-50 to 1.5B for GPT-2. This phenomenal increase in model size is reflected directly in the cost of training such complex models. For example, the cost of training the bidirectional transformer network BERT, for Natural Language Processing applications, is estimated at $61,000, whereas training XLNet, which outperformed BERT, costs about nine times as much. However, a major concern is not only the cost associated with the substantial energy consumption needed to train complex networks but also the significant environmental impact incurred in the form of CO2 emissions.
As the world looks to reduce carbon emissions, there is an even greater need for higher performance with lower power consumption. This is true not only for AI applications in the data center, but also at the Edge, which is where we expect the next revolution to take place. AI at the Edge refers to processing of data where it is collected, as opposed to requiring data to be moved to separate processing centers. There is a wealth of applications at the edge: AI for mobile devices, including authentication, speech recognition, and mixed/augmented reality, AI for embedded processing for IoT devices, including smart cities and homes or embedded processing for prosthetics, wearables, and personalized healthcare, as well as AI for real-time video analytics for autonomous navigation and control. However, these embedded applications are all energy and memory constrained, meaning energy efficiency matters even more so at the Edge. The end of Moore’s and Dennard’s laws are compounding these challenges. Thus, there are compelling motivations to explore novel computing architectures with inspiration from the most efficient computer on the planet, the human brain.
Traditional Computing Systems: Current State of Play
Traditional digital computing systems, based on the von Neumann architecture, consist of separate processing and memory units. Therefore, performing computations typically results in a significant amount of data being moved back and forth between the physically separated memory and processing units. This data movement costs latency and energy and creates an inherent performance bottleneck. The latency associated with the growing disparity between the speed of memory and processing units, commonly known as the memory wall, is one example of a crucial performance bottleneck for a variety of AI workloads. Similarly, the energy cost associated with shuttling data represents another key challenge for computing systems that are severely power limited due to cooling constraints as well as for the plethora of battery-operated mobile devices. In general, the energy cost of multiplying two numbers is orders of magnitude lower than that of accessing numbers from memory. Therefore, it is clear to AI developers that there is a need to explore novel computing architectures that provide better collocation of processing and memory subsystems. One suggested concept in this area is near-memory computing, which aims to reduce the physical distance and time needed to access memory. This approach heavily leverages recent advances made in die stacking and new technologies such as the high memory cube (HMC) and high bandwidth memory (HBM).
In-Memory Computing: A Radical New Approach
In-memory computing is a radically different approach to data processing, in which certain computational tasks are performed in place in the memory itself (Sebastian 2020). This is achieved by organizing the memory as a crossbar array and by exploiting the physical attributes of the memory devices. The peripheral circuitry and the control logic play a key role in creating what we call an in-memory computing (IMC) unit or computational memory unit (CMU). In addition to overcoming the latency and energy issues associated with data movement, in-memory computing has the potential to significantly improve the computational time complexity associated with certain computational tasks. This is primarily a result of the massive parallelism created by a dense array of millions of memory devices simultaneously performing computations.
For instance, crossbar arrays of such memory devices can be used to store a matrix and perform matrix-vector multiplications (MVMs) at constant O(1) time complexity without intermediate movement of data. The efficient matrix-vector multiplication via in-memory computing is very attractive for training and inference of deep neural networks, particularly for inference applications at the Edge where high energy efficiency is critical. In fact, matrix-vector multiplications constitute 70-90% of all deep learning operations. Thus, applications requiring numerous AI components such as computer vision, natural language processing, reasoning and autonomous driving can explore this new technology in new and innovative ways. Novel dedicated hardware with massive on-chip memory, where part of it is enhanced with in-memory computation capabilities could lead to very efficient training and inference engines of ultra-large neural networks comprising of potentially billions of synaptic weights.
The core technology of IMC is memory. In general, there are two classes of memory devices. The conventional one, in which information is stored in the presence or absence of charge, includes dynamic random-access memory (DRAM), static random-access memory (SRAM) and Flash memory. There is also an emerging class of memory devices, in which information is stored in terms of the atomic arrangements within nanoscale volumes of materials, as opposed to charge on a capacitor. Generally speaking, one atomic configuration corresponds to one logic state, and the other corresponds to another logic state. These differences in atomic configuration manifest as a change in resistance, and thus these devices are collectively called resistive memory devices or memristors. Traditional and emerging memory technologies can perform a range of in-memory logic and arithmetic operations. In addition, SRAM, Flash and all memristive memories can also be used for MVM operations.
The most important characteristics of a memory device are its read and write times, that is how fast a device can store and retrieve information. Equally important characteristics are the cycling endurance, which refers to the number of times a memory device can be switched from one state to the other, the energy required to store information in a memory cell as well as the size of the memory cell. Table 1 -compares the traditional DRAM, SRAM and NOR Flash with the most popular emerging resistive-memory technologies, such as spin-transfer torque RAM (STT-RAM), phase-change memory (PCM) and resistive RAM (ReRAM).
Table 1 – Comparing different memory technologies. Sources:(B. Li 2019), (Marinella 2013)
Which Memory Technology for Which Operation? Considerations to Keep in Mind
There are many trade-offs involved in selecting which memory technology is suitable for MVM operations for the target DL workloads. For example, read latency, to a large extent, determines the performance of the system, also known as throughput, in operations per second (OPS). This means it also indirectly affects the system’s efficiency, measured in OPS/W. On the other hand, memory volatility, as well as the write time, determine whether the system supports static or reloadable weights. Cycling endurance is another important characteristic to keep in mind, as it determines the suitability of a memory technology for training and/or inference applications. For example, the limited endurance of PCM, ReRAM and Flash memory devices precludes them from DL training applications. The cell size also has an impact on the compute density. Specifically, it affects the die area and therefore the ASIC cost.
It is also important to look at temperature stability, drift phenomena and noise effects. In general, all memory devices exhibit intra-device variability and randomness that is intrinsic to how they operate. However, resistive memory devices appear to be more prone to noise (read and write), nonlinear behaviour, inter-device variability and inhomogeneity across an array. Thus, the precision achieved when using memristive technologies for analogue matrix-vector operations is typically not very high and requires the use of additional hardware-aware training techniques to achieve FP32-equivalent accuracies. Finally, the compatibility of the manufacturing process for memory devices with the CMOS technology and their scalability to lower lithography nodes are very important considerations for the successful commercialization of IMC technology and its future roadmap.
SRAM has a unique advantage in that it exhibits the fastest read and write time and highest endurance compared to other memory devices. Thus, SRAM enables high performance and reprogrammable IMC engines for both inference and training applications. Moreover, SRAM follows the scaling of CMOS technology to low lithography nodes and requires standard materials and processes that are readily available to foundries. On the other hand, it is a volatile memory technology that consumes energy not only when it is at the idle state but also for data retention. In addition, SRAM’s cell size, approximately 100 F2, is the largest of all charge- and resistance-based memory technologies. However, volatility is not a serious drawback, as the applications very rarely dictate static models. In fact, the fast write time of SRAM is a crucial advantage, allowing it to alternate DL models through very fast re-programmability. Finally, from a system architecture standpoint, due to the fast re-programmability of SRAM, there is no need to map the entire DNN onto multiple crossbar arrays of memory devices that would result in a large and costly ASIC.
Recently, IMEC reported an SRAM-based IMC Multiply-Accumulate unit (MAC) with a record energy efficiency of 2900 TOPS/W using ternary weights (imec 2020). There are also experimental prototype SRAM demonstrators that support INT8 activations and weights whose precision scales linearly with latency, power consumption and area. Clearly, the in-memory MAC implementation and operation are only one part of a multi-faceted IMC-based system. Other digital units are needed to support element-wise vector processing operations, including activation functions, depth-wise convolution, affine scaling, batch normalization and more. Moreover, the performance and usability of a multicore IMC engine also depends on multiple characteristics: optimized memory hierarchy, well-balanced fabric, fine-tuned quantization flow, optimized weight-mapping strategies and a versatile compiler and software tool chain.
There have been a lot of advancements made in the computing sector, with even more to come. Our customers, and the industry as a whole, have made it clear that they would like to have a system that offers high throughput, high efficiency and high accuracy – the three highs -, which is also easy to use and of course, cost-effective. At Axelera AI, we are working to design a system that offers all these capabilities and much more. Our AI solution will be based on a novel multicore in-memory computing paradigm combined with an innovative custom dataflow architecture.
B. Li, B. Yan, H. Li. 2019. “An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications.” Great Lakes Symposium on VLSI.
imec. 2020. Imec and GLOBALFOUNDRIES Announce Breakthrough in AI Chip, Bringing Deep Neural Network Calculations to IoT Edge Devices. Jul. Accessed Nov 2021. https://www.imec-int.com/en/articles/imec-and-globalfoundries-announce-breakthrough-in-ai-chip-bringing-deep-neural-network-calculations-to-iot-edge-devices.
Marinella, M. 2013. “ERD Memory Planning – updated from last weeks telecon.”
Sebastian, A., Le Gallo, M., Khaddam-Aljameh, R., Eleftheriou, E. 2020. “Memory devices and applications for in-memory computing.” Nature Nanotechnoly.