NVIDIA GPU Architecture: from Pascal to Turing to Ampere

Introduction

This paper focuses on the key improvements found when upgrading an NVIDIA® GPU from the Pascal to the Turing to the Ampere architectures, specifically from GP104 to TU104 to GA104. (The Volta architecture that preceded Turing is mentioned but is not a focus of this paper.)

NVIDIA GPUs have always excelled at video graphics processing and in providing support for general purpose data processing that benefitted from massive parallel processing algorithms. In the update from Pascal to Volta/Turing NVIDIA also became a leader in artificial intelligence (AI) processing with the inclusion of Tensor cores, which were first introduced in the Volta architecture for data centers in 2017, followed by their introduction in the Turing architecture for desktop and other use cases in 2019. The Turing architecture also introduced Ray Tracing cores used to accelerate photo realistic rendering. With Ampere NVIDIA has continued to make significant improvements to the GPU, including updates to CUDA® core processing data paths and updates to the next generation of Turing cores and Ray Tracing cores.

NVIDIA GPU Architecture diagram
Figure 1: NVIDIA Ampere GA104 architecture. Details for each SM are shown in Figure 2.

 

High-Level Components used in GPUs

The high-level components in the NVIDIA GPU architecture have remained the same from Pascal to Volta/Turing to Ampere:

  • PCIe Host Interface
  • GigaThread engine
  • Memory controllers
  • L2 Cache
  • Graphics Processing Clusters (GPCs)

 

Table 1: Component Blocks used in an NVIDIA GPU

 

Pascal GP104

Turing TU104

Ampere GA104

PCIe Host Interface

Gen 3

Gen 3

Gen 4

Memory type supported

GDDR5

GDDR6

GDDR6

Memory Controllers

8 32-bit (256-bits total)

8 32-bit (256-bits total)

8 32-bit (256-bits total)

Memory Bandwidth

320 GB/s

448 GB/s

448 GB/s

L2 Cache Size

2048 KB

4096 KB

4096 KB

Graphics Processing Clusters (GPCs)/GPU

4

5 or 6
(SKU dependent)

6

 

PCIe Host Interface: The Ampere GPU updated the PCIe host interface to PCIe 4.0. This can provide double the bandwidth compared to Gen 3, and it is still fully compatible with the previous PCIe generation interfaces.

Memory Support: The Pascal GPU supported GDDR5 memory. The Ampere and Turing GPUs support GDDR6 memory. GDDR6 supports higher bandwidth, a bigger interface, and is more energy efficient than GDDR5. It is also higher density, so more memory can be included when using the same footprint.

 

Components in Graphics Processing Clusters (GPCs)

Graphics processing clusters are the data processing engines of the GPU. Each GPC includes:

  • 1 Raster Engine
  • 2 Raster Operator Partitions (ROPs), each containing 8 ROP units
  • Texture Processing Clusters (TPCs) which include:
    • PolyMorph Engine
    • Streaming Multiprocessors (SMs)
    • Since Volta/Turing: Tensor Core
    • Since Turing: Ray Tracing Core

 

Table 2: Component blocks in an NVIDIA Graphics Processing Cluster (GPC)

 

Pascal GP104

Turing TU104

Ampere GA104

ROPs

64 (tied to the memory controller and L2 cache)

64 (tied to the memory controller and L2 cache)

96 (integrated into GPC)

Texture Processing Clusters (TPCs)/GPC

5

4

4

TPC/GPU

20

20 or 24
(SKU dependent)

24

Streaming Multiprocessors (SM)/TPC

1

2

2

Maximum SM/GPU
(the actual number is SKU dependent)

20

48

48

 

Raster Operator (ROP) Units: In Pascal and Turing architectures ROPs were tied to the memory controller and L2 cache. In the Ampere architecture ROPs are integrated into each Graphics Processing Cluster (GPC). Including ROP partitions in the GPC helps to eliminate bottlenecks. There are also a higher overall number of ROP units in Ampere GPUs.

 

Other High-Level Architecture Changes

Manufacturing Process and Power Efficiency: Chips are manufactured using processes that determine the size of each transistor on the chip measured in nm. The smaller the size is the faster the transistor will be and the less power it will use at the same performance level.

Display and Video Engine: With each generation support for higher resolution display output has increased, and when using an Ampere GPU with VESA Display Stream Compression (DSC) technology enabled High Dynamic Range (HDR) rendering is also supported. Hardware accelerated encoding and decoding have also continued to improve, offloading the most computationally intense tasks from the CPU to the GPU, providing real-time performance for high resolution encoding and decoding.

 

Table 3: Other High-Level Architecture Changes to NVIDIA GPUs

 

Pascal GP104

Turing TU104

Ampere GA104

Manufacturing Process

16 nm

12 nm

8 nm

Transistors per GPU

7.2 billion

13.6 billion

17.4 billion

TGP (Watts)

180

215 - 230

220

DisplayPort output

1.2 certified
4K @ 60Hz
(1.4 ready)

1.4a
4K @240 Hz
8K @ 60 Hz

1.4a
4K @240 Hz + HDR
8K @ 60 Hz + HDR

HDMI output

2.0b
4K @ 60Hz
8K @ 30Hz

2.0b
4K @ 60Hz
8K @ 30Hz

2.1
4K @240 Hz + HDR
8K @ 60 Hz + HDR

NVENC (hardware accelerated encode)

4th Gen

7th Gen 
HEVC B‑Frame support

7th Gen

NVDEC (hardware accelerated decode)

3rd Gen

4th Gen

5th Gen with AV1

 

Streaming Multiprocessor (SM) Architecture

Major improvements have been made to many of the components found in the Streaming Multiprocessors in each subsequent generation.

Streaming Multiprocessor (SM) Diagram
Figure 2: NVIDIA Streaming Multiprocessor architecture for Pascal, Turing, Ampere

 

Each Streaming Multiprocessor (SM) includes:

  • Four SM Processing Blocks (Partitions), and each includes:
    • CUDA data paths which can handle Floating Point (FP) or Integer (INT) calculations. The way the CUDA cores are assigned to perform a specific type of calculation has changed over the generations (see below for more info).
    • Tensor Core (Turing/Ampere)
    • Instruction cache per SM (Pascal) or L0 Instruction Cache per SM Block (Turing/Ampere)
    • Warp scheduler and Dispatch Unit. The way tasks are assigned has significantly improved over the generations to optimize core use (see below for more info).
    • Register File
    • Load/store units (LD/ST units)
    • Special function units (SFU) for transcendental math functions (e.g., log x, sin x, cos x, ex)
  • L1 Data Cache/Shared Memory; this was consolidated starting with Turing
  • Texture Units
  • Ray Tracing Core (Turing/Ampere)
  • Two FP64 units (Turing/Ampere)

 

Table 4: Streaming Multiprocessor Changes

 

Pascal GP104

Turing TU104

Ampere GA104

CUDA Cores/SM
For FP32 / INT32

128 FP32 or INT32

64 FP32 and 64 INT32

64 FP32 only,
64 FP32 or INT32

CUDA Cores/GPU
For FP32

2560 cores
(20 SM, 128 cores/SM)

3072 cores
(48 SM, 64 cores/SM)

3072 or 6144 FP cores
(64 or 128 cores/SM)

SM Cores concurrent execution

Cores could be used for FP32 or INT32, no concurrent execution per partition

one FP32 partition, one INT32 partition, concurrent execution of FP and INT

one FP32 partition and one FP32 or INT32 partition, concurrent execution of FP and INT possible

Shared Memory/L1 Cache/SM

64 KB Shared Mem
(Texture/L1 separate)

96 KB Shared Mem

128 KB Shared Mem

Total Shared Memory/L1 Cache

1280 KB

4608 KB
(48 SM x 96KB per SM)

6144 KB
(48 SM x 128KB per SM)

Memory handling

Separate instruction cache and per partition buffer; two L1 cache; shared memory

New L0 instruction Cache per partition; combined L1/Shared Memory (as per Volta)

Similar structure as with Turing, but with larger memory

Warp Scheduler and Dispatch Unit

warp scheduler + 2 dispatcher units

warp scheduler + dispatch unit; independent thread scheduling for sub‑warp granularity (as per Volta)

warp scheduler + dispatch unit (as per Volta/Turing)

Ray Tracing Cores

None

Gen 1, 1 RT core/SM

Gen 2, 1 RT core/SM
(Gen2 has 2x processing of Gen 1)

Tensor Cores

None

320 of Gen 2
(Gen 1 released on Volta)

184 of Gen 3
(Gen3 has 2x processing of Gen 2)

 

CUDA Datapath Changes

CUDA cores can be used for FP32 or for INT32 operations. With the Pascal architecture SM partitions could either be assigned to FP32 or they could be assigned to INT32 operations, but they could not execute both simultaneously. With the Turning architecture SM partitions separated the CUDA cores into two data paths, one dedicated to FP32, and the other dedicated to INT32. This allowed Turing SM partitions to execute both FP32 and INT32 operations simultaneously. With the Ampere architecture the two data paths of Turing are still present, and one of them is still dedicated to FP32, but the other can now be used for either FP32 or INT32, depending on what is in demand.

Graphic workloads often require more FP32 calculations than INT32 calculations. In NVIDIA’s Turing Architecture whitepaper they estimated that in then current games “for every 100 FP32 pipeline instructions there are about 35 additional instructions that run on the integer pipeline”, or approximately 26% of the required operations for those games are integer operations. (See NVIDIA TURING GPU ARCHITECTURE, page 66) Given their unequal use ensuring that one of the Ampere SM data paths can flexibly be used for either FP32 or INT32 calculations ensures that there will be no idle cores waiting for INT calculation tasks as those cores can now be assigned FP calculation tasks.

 

Ray Tracing Cores Generation 2

The second generation Ray Tracing cores found in Ampere architecture GPUs can effectively deliver twice the performance of the first generation Ray Tracing cores found in Turing architecture GPUs. Ampere SMs also allow RT core and CUDA core compute workloads to run concurrently, introducing even more efficiencies. For users who need to render complex models with accurate shadows, reflections and refractions, or to render ray-traced motion blur, the Ampere RT cores will provide big performance improvements.

 

Tensor Cores Generation 3

The third generation Tensor cores found in Ampere GPUs can provide a much higher performance compared to the second generation Tensor cores found in Turing GPUs. The new Tensor cores have added acceleration for many more data types. The Volta Tensor core added FP16, The Turing Tensor cores introduced INT8, INT4 and binary 1-bit precisions, and the Ampere Tensor cores add support for TF32 and BF32 data types. Depending on the type of workload the 3rd generation Tensor cores can deliver 2x to 4x more throughput compared to the previous generation.

Ampere Tensor cores also include a new Fine-Grained Structured Sparsity feature, which uses only the subset of weights that have acquired a meaningful purpose during the learning process, which leads to more efficient inference acceleration with sparsity.

 

SM Memory Changes

The update from Pascal to Turing included an SM memory path redesign to unify shared memory, texture caching, and memory load caching into one unit. This provided two times more bandwidth and two times more capacity for L1 for common workloads. The amount of memory also increased from generation to generation.

 

Warp Scheduling Changes

In an NVIDIA GPU the basic unit for executing an instruction is the warp. A warp is a collection of threads that all share the same code and are all executed simultaneously by a Streaming Multiprocessor (SM). Multiple warps can be executed on an SM at once.

Pascal was designed to support many more active warps and threadblocks than previous architectures. Each warp scheduler was capable of dispatching two warp instructions per clock cycle.

Volta SM processing blocks each had a single warp scheduler and a single dispatch unit. This meant that Volta could only issue one independent instruction per clock cycle. However, it gained independent thread scheduling, it included a program counter and call stack per thread, and it included a schedule optimizer. Taken together this allows threads to diverge at sub-warp granularity, which helps to ensure optimal usage of the cores.

Turing and Ampere inherited all of the Volta improvements to warp scheduling, resulting in significant processing optimization.

 

Software Tools

NVIDIA provides numerous software tools to help developers to accelerate GPU-based application development. With each new GPU generation new tools and new features are added.

 

CUDA Toolkit and CUDA Compute

The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime. Each major new architecture release is accompanied by a new version of the CUDA Toolkit, which includes tips for using existing code on newer architecture GPUs, as well as instructions for using new features only available when using the newer GPU architecture.

CUDA Compute capability allows developers to determine the features supported by a GPU. Ampere GPUs have a CUDA Compute Capability of 8.6, Turing GPUs 7.5, and Pascal GPUs 6.1.

For specific information the NVIDIA CUDA Toolkit Documentation provides tables that list the “Feature Support per Compute Capability” and the “Technical Specifications per Compute Capability”.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

 

CUDA-X AI and CUDA-X HPC

CUDA-X is a collection of libraries, tools, and technologies built on top of CUDA specifically to support AI and HPC. These libraries work with NVIDIA GPUs which include Tensor cores.

NVIDIA also provides integrated support in a number of open source partner libraries, providing built-in GPU acceleration for numerous types of applications.

See: https://developer.nvidia.com/gpu-accelerated-libraries

See: https://developer.nvidia.com/hpc

 

Conclusion

With the release of each new GPU generation NVIDIA has continued to deliver huge increases in performance and revolutionary new features. Whether an application requires enhanced image quality or powerful compute and AI acceleration, upgrading to the latest NVIDIA Ampere architecture will provide significant performance improvements.

 

Trademarks

NVIDIA, the NVIDIA logo, and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. All other trademarks are property of their respective owners.

 

 

 

 

 

This website uses cookies to collect information about site usage.