AGX Orin vs AGX Xavier: Jetson Ampere GPU Brings 8x AI Performance
NVIDIA Jetson AGX Xavier had a significant impact on the development of AI and deep learning applications at the edge. Jetson AGX Orin is the latest system-on-module from NVIDIA and is described as “the world’s most powerful AI computer for energy-efficient autonomous machines”. Much like the AGX Xavier, the AGX Orin is designed to operate autonomously at the Edge and bring advancements in both performance and efficiency.
The NVIDIA Jetson AGX Orin module shares the same compact form factor with a footprint measuring just 100mm x 87mm and can either be integrated onto a carrier board or inside a box PC to create an embedded system. Specifically designed for industrial applications such as autonomous vehicles, smart manufacturing, robotics, healthcare, transportation, agriculture, aerospace, defence, and other harsh environments, our team at Things Embedded can recommend an embedded system based on AGX Orin which carries form, fit, and function for all Edge AI applications.
This article will evaluate and explain the advancements that have been made for Artificial Intelligence at the Edge with our latest rugged Jetson computers via the onboard GPU, CPU, Memory, Storage, Video Codecs, and Programmable Vision Accelerator.
NVDIA Jetson Ampere GPU: A Major Advancement in Edge AI
AGX Xavier and AGX Orin are built with two different GPU architectures. The Volta GPU architecture found on AGX Xavier and Ampere GPU architecture found on AGX Orin have some significant differences in terms of performance and capabilities.
The new generation of GPU offers up to 2048 CUDA cores, 64 Tensor cores, and a maximum of 131 TOPs of INT8 Tensor compute, as well as 4.096 FP32 TFLOPs of CUDA compute. The new Tensor Float 32 (TF32) precision offers up to five times more training power than its predecessor, allowing AI and data science model training to be more efficient without altering any code. Both 32GB and 64GB AGX Orin Modules have an integrated Jetson Ampere GPU.
3rd Generation Tensor Cores
A tensor is a mathematical representation of multiple numerical values arranged in multiple dimensions. They are heavily utilized in deep learning models, and are the fundamental structures of neural networks. The calculations done on tensors, such as matrix multiplication and convolution, are the basis of deep learning. Third-generation tensor cores and TensorRT Cores on the AGX Orin Ampere GPU are designed to accelerate deep learning workloads by performing matrix operations with high throughput and low latency. Some benefits of these technologies include:
Increased Performance: Third-generation Tensor Cores and TensorRT Cores can significantly accelerate deep learning workloads, providing faster training and inference times. They can perform matrix operations much faster than traditional CPU or GPU architectures, which can help reduce the time and resources needed to train large neural networks.
Improved Energy Efficiency: Tensor Cores and TensorRT Cores can perform matrix operations with higher efficiency, reducing the energy consumption of deep learning workloads. This can lead to cost savings and improved environmental sustainability.
Support for Mixed-Precision: Tensor Cores and TensorRT Cores support mixed-precision arithmetic, which allows for faster training and inference with lower precision data types. This reduces the amount of memory bandwidth required and can improve overall performance.
Scalability: Tensor Cores and TensorRT Cores are designed to work in parallel, allowing for scalable performance on large deep learning workloads. This can enable the training of larger models and the processing of larger datasets.
Sparsity compute architecture refers to a type of computer hardware design that is optimized for processing sparse data. Sparsity compute architecture takes advantage of the sparsity of data by using specialized hardware and software to perform computations only on the non-zero values in a dataset, reducing the amount of computation required and improving overall efficiency. This can lead to significant improvements in performance and energy efficiency compared to traditional computing architectures that do not take sparsity into account. Sparsity compute architecture is becoming increasingly important in the field of artificial intelligence, where it is used to accelerate training and inference on deep neural networks that operate on large, sparse datasets. The Jetson Ampere GPU offers advanced performance capabilities through its support of sparsity, enabling users to experience double the efficiency of Turing Tensor Core operations compared to the previous generation.
A Streaming Multiprocessor (SM) is a key component of NVIDIA GPUs that is responsible for executing parallel processing tasks. The SMs on Jetson GPUs are responsible for executing CUDA threads, which are the basic unit of parallel processing on NVIDIA GPUs. Each SM contains multiple CUDA cores, which work together to execute CUDA threads in parallel. The SMs also contain specialized hardware blocks, such as Tensor Cores and INT8 units, which are optimized for deep learning and computer vision workloads. The Ampere GPU onboard Jetson AGX Orin is a revolutionary design for the Streaming Multiprocessor that offers improved performance per watt, performance per area, and supports 3rd generation tensor cores and TensorRT cores.
All New Arm Cortex-A78AE CPU: 1.7x Performance
Jetson AGX Orin has incorporated a substantial shift in its CPU by exchanging NVIDIA Carmel CPU clusters with Arm Cortex-A78AE. This new CPU is composed of 12 CPU Cores, which provides a 1.7x enhancement in performance when compared to the 8-core NVIDIA Carmel CPU on Jetson AGX Xavier. Each of the twelve cores onboard the new Arm Cortex-A78AE processor is equipped with a 64KB Instruction L1 Cache and 64KB Data Cache, plus an additional 256KB of L2 Cache. Similarly to the Jetson AGX Xavier, each cluster comes with a 2MB L3 Cache. The maximum speed the Arm Cortex-A78AE CPU can reach is 2GHz.
1.4x Memory Bandwidth and 2x eMMC Storage
The Jetson AGX Orin offers greater storage and memory capabilities than the Jetson AGX Xavier. The AGX Orin has either 32Gb or 64GB of 256-bit LPDDR5 main memory integrated, providing up to 204.8 GB/s of memory bandwidth. In comparison, the AGX Xavier has 136.5 GB/s of memory bandwidth from its LPDDR4 memory integrated onboard. Both Jetson Orin AGX 32GB and Jetson Orin AGX 64GB modules provide 64 of onboard eMMC giving you twice the storage found onboard the range of AGX Xavier modules.
Multi-Standard Video Encoding, Decoding, and JPEG Processing
Jetson AGX Orin contains a Multi-Standard Video Encoder, a Multi-Standard Video Decoder, and a JPEG processing block.
NVENC NVIDIA Encoder: NVENC is designed to accelerate video encoding for applications such as live streaming, video conferencing, and video editing. By using NVENC, the CPU is free to perform other tasks, while the NVIDIA GPU handles the encoding process. This can result in significant performance improvements and reduced power consumption. NVENC provides complete hardware acceleration for multiple compression formats, such as H.265, H.264, and AV1.
NVDEC NVIDIA Decoder: NVDEC is created to speed up the process of decoding videos. Utilizing NVDEC also removes the burden from the CPU and lets the NVIDIA GPU take on the decoding, allowing for increased performance and lowered power consumption. NVDEC can take advantage of specialized hardware to quickly and efficiently decode video streams using H.265, H.264, AV1, and VP9 standards.
NVJPEG Processing Block: NVJPEG refers to NVIDIA’s implementation of the JPEG image format using their CUDA parallel computing platform. It is a library of functions that allows developers to efficiently decode and encode JPEG images using NVIDIA GPUs. NVJPG is responsible for performing operations related to the JPEG still image standard, such as compression and decompression, scaling images, decoding specific color formats (YUV420, YUV422H/V, YUV444, YUV400), and RGB and YUV color space conversion.
Next-generation Programmable Vision Accelerator Engine: PVA v2
Jetson Orin features version 2 of its Programmable Vision Accelerator, known as PVA v2. This dedicated hardware accelerator is specifically designed to accelerate computer vision workloads. PVA v2 in Jetson Orin offers significant improvements in terms of performance and efficiency over the previous generation of PVA technology found onboard the AGX Xavier. It includes a range of new features such as an improved tensor core design, enhanced memory bandwidth, and support for new precision formats.
With the PVA v2 in Jetson Orin, developers can accelerate a wide range of computer vision tasks, including object detection, segmentation, and classification, using a combination of hardware and software optimizations. This can help to improve the overall performance and efficiency of computer vision applications running on Jetson Orin, allowing for faster processing of image and video data.