The Metis AI Platform
A technical Deepdive

Evangelos Eleftheriou | CTO at AXELERA AI

The Metis AI Platform is a one-of-a-kind holistic hardware and software solution establishing best-in-class performance, efficiency, and ease of use for AI inferencing of computer vision workloads at the Edge. It encompasses the recently taped-out high-performance Metis AI Processing Unit (AIPU) chip, designed in 12nm CMOS, and the comprehensive Voyager Software Development Kit (SDK).

Axelera’s Metis AIPU

Axelera’s Metis AIPU is equipped with four homogeneous AI cores built for complete neural network inference acceleration. Each AI core is self-sufficient and can execute all layers of a standard neural network without external interactions. The four AI-Cores can either collaborate on a workload to boost throughput or operate on the same neural network in parallel to reduce latency or process different neural networks required by the application concurrently.

The AI core is a RISC-V-controlled dataflow engine delivering up to 53.5 TOPS of AI processing power featuring several high-throughput data-paths to provide balanced performance over a vast range of layers and to address the heterogenous nature of modern neural network workloads. The total throughput of Axelera’s four-core Metis AIPU can reach 214 TOPS at a compute density of 6.65 TOPS/mm2.

At the heart of each AI core is a massive in-memory-computing-based matrix-vector-multiplier to accelerate matrix operations, and thereby convolutions, offering an unprecedented high energy efficiency of 15 TOPS/W. These matrix-vector multiplications constitute 70-90% of all deep learning operations. In-memory computing is a radically different approach to data processing, in which crossbar arrays of memory devices are used to store a matrix and perform matrix-vector multiplications at constant O(1) time complexity without intermediate movement of data. The matrix-vector multiplication via in-memory computing is extremely efficient for

Accuracy and noise immunity with D-IMC

Axelera AI has fundamentally changed the architecture of “compute-in-place” by introducing an SRAM-based digital in-memory computing (D-IMC) engine. In contrast to analog in-memory computing approaches, Axelera’s D-IMC design is immune to noise and memory non-idealities that affect the precision of the analog matrix-vector operations as well as the deterministic nature and repeatability of the matrix-vector multiplication results. Our D-IMC supports INT8 activations and weights, but the accumulation maintains full precision at INT32, which enables state-of-the-art FP32 iso-accuracy for a wide range of applications without the need for retraining.

The D-IMC engine of the matrix-vector-multiplier is a handcrafted full-custom design that interleaves the weight storage and the compute units in an extremely dense fashion. Besides saving energy by not moving weights, energy consumption is further reduced by a custom adder with minimized interconnect parasitics, with balanced delay paths to avoid energy-consuming glitches, and with judicious pipelining to provide high compute throughput at low supply voltage. Although the matrix-vector-multiplier supports a large matrix size, by using both activity and clock gating, energy efficiency stays high even at low utilization. Note that the matrix coefficients can be written to the D-IMC engine in the background without stalling the computations.

In addition to the D-IMC-based matrix-vector-multiplier, each AI core features a unit for block-sparse diagonal matrix operations to provide balanced performance for layers such as depth-wise convolution, pooling and rescaling that have a high IO-to-compute ratio compared to normal matrix-vector multiplications. Lastly, to address element-wise vector operations and other non-matrix-based operations including activation function computations, a stream vector unit is provided. This unit can operate on floating-point numbers to address the increases numerical precision requirements of those functions.

Providing massive compute power is only one consideration. Having a high-throughput and high-capacity memory close to the compute element is equally important for good overall performance and power efficiency: besides 1 MiB of computational memory in the matrix-vector-multiplier that can be accessed with several tens of terabit per second, each core features 4MiB of L1 memory that can be accessed with multiple streams concurrently with an aggregated bandwidth of multiple terabits per second. The combination of these two memories offers a total of 5MiB of tightly coupled high-speed memories within a single AI core.

A fully integrated SoC

The four AI cores are integrated into a System-on-Chip (SoC), comprising RISC-V, PCIe, LPDDR4x, embedded Root of Trust, an at-speed Crypto engine and large on-chip SRAM, all connected via a high-speed and packetized Network-on-Chip (NoC). First, the application class RISC-V control core, running a real-time operating system, is responsible for booting the chip, interfacing with external peripherals and orchestrating collaboration between AI cores. Second, the PCIe provides a high-speed link connection to an external host for offloading full neural network applications to the Metis AIPU. Finally, the NoC connects the AI cores to a multi-level shared memory-hierarchy with 32MiB of on-chip L2 SRAM and multiple GiB of LPDDR4 SDRAM, ultimately connecting more than 52MiB of on-chip high-speed memories if the memories of the AI-Cores are included. The NoC splits control and data transfers and is further optimized to minimize contention for simultaneous access of multiple data managers (AI core, RISC-V core, or external host) to the AI cores and higher-level memories in the memory hierarchy. As such, it offers more than a terabit per second of aggregated bandwidth to the shared memories, ensuring the AI cores will not stall in highly congested multi-core scenarios.

By pairing the massive compute capabilities provided by our D-IMC technology with an advanced memory subsystem and a flexible control scheme, the Metis AIPU chip can handle multiple demanding complete neural network tasks in parallel, with an unparalleled energy efficiency.

Axelera’s Voyager SDK

The Voyager SDK provides an end-to-end integrated framework for Edge application development. It is built in-house with a focus on user experience, performance, and efficiency. In its first release, the SDK is optimized specifically for the development of computer vision applications for the Edge and enables developers to adopt and use the Metis AI platform for these use cases quickly and easily. Voyager takes users along the entire development process without requiring them to understand the internals of the Metis AIPU or to have expertise in deep learning: Developers can start from turnkey pipelines for state-of-the-art models, customize these models to their particular application domain, deploy to Metis-enabled devices and evaluate performance and accuracy with one-click simplicity.

As part of Axelera’s Metis AI platform, developers are provided access to the Axelera Model Zoo, which is accessible on the Web and via cloud APIs. The Model Zoo offers state-of-the-art neural networks and turnkey application pipelines for a wide variety of use cases such as image classification, object detection, segmentation, key point detection and face recognition. Developers can also import their own pre-trained models with ease: Axelera’s toolchain automatically quantizes and compiles models that have been trained in many different ML frameworks such as PyTorch and TensorFlow and it generates code that runs on the Metis AIPU with industry-leading performance and efficiency.

Developers start their journey by downloading the SDK onto their development workstation and providing a high-level declarative definition of their application pipeline, which can either be created from scratch or by modifying a preexisting template from the Model Zoo. In many cases, this is sufficient to enable Voyager to generate optimized code for many different host platforms, leveraging optimized libraries for non-neural processing tasks such as those often found in neural network pre-processing and post-processing stages. The optimized code can be directly embedded into an inference server, which exposes the capabilities of Metis as a network service to clients and deployed to an Edge host over the network either as a native binary or a container. The client/server architecture makes it easy to construct end-to-end solutions for diverse application environments, from fully embedded processing of a MIPI CSI camera to distributed processing of multiple RTSP streams across networked devices.

Axelera’s software stack is built on industry-standard open-source frameworks and APIs, extended with advanced capabilities. Our Machine Learning compiler, a key element of the Voyager SDK, is built on the Apache TVM open-source compiler framework and implements Axelera’s industry-leading quantization technology: AI predictions on the Metis AIPU are practically indistinguishable from those produced by high-precision FP32 arithmetic systems. The SDK also builds on top of Gstreamer, extending this popular open-source framework with optimized plugins and libraries for processing data with maximum efficiency. Expert users can take advantage of the versatile nature of our stack and the open APIs to further customize code produced by our Voyager SDK for their particular use cases.