NVIDIA 16nm Pascal Based Tesla P100 With GP100 GPU Unveiled – Worlds First GPU With HBM2 and 10.6 … (original) (raw)

NVIDIA has officially unveiled the Pascal based Tesla P100 GPU which is their fastest GPU to date. The Pascal GP100 chip is NVIDIA's first GPU to be based on the latest 16nm FinFET process node which delivers 65 percent higher speed, around 2 times the transistor density increase and 70 percent less power than its 28HPM tech. The new FinFET process allows NVIDIA to gain up to 2 times the performance per watt improvement on Pascal compared to the Maxwell GPUs.

2 of 9

NVIDIA Pascal Tesla P100 Unveiled - 15.3 Billion Transistors on a 610mm2 16nm Die - 16 GB HBM2 Memory With Insane Compute

The NVIDIA Pascal Tesla P100 GPU revives the double precision compute technology on NVIDIA chips which was not featured on the Maxwell generation of cards. The Maxwell generation brought NVIDIA in the most competitive position with a lineup filled with amazing graphics card that won not only in performance per watt but also the performance to value segments. NVIDIA has developed a large ecosystem around their Maxwell cards which is now represented by the GeForce brand.

With Pascal, NVIDIA will not only be aiming at the GeForce brand but also the high-performance Tesla market. The Tesla market is the action filled lineup where the big chips are aimed at. NVIDIA has received huge demand of next-generation chips in this market and they have prepped a range of next-gen chips specifically for the HPC market.

The GP100 GPU used in Tesla P100 incorporates multiple revolutionary new features and unprecedented performance. Key features of Tesla P100 include:

Extreme performance—powering HPC, deep learning, and many more GPU Computing areas;
NVLink—NVIDIA’s new high speed, high bandwidth interconnect for maximum application scalability;
HBM2—Fastest, high capacity, extremely efficient stacked GPU memory architecture;
Unified Memory and Compute Preemption—significantly improved programming model;
16nm FinFET—enables more features, higher performance, and improved power efficiency.

The current 28nm products have existed in the Tesla market since early 2012. This was the time when NVIDIA had started shipping the GK110 GPUs to built the Titan Supercomputer. The Tesla K20X was used to power the fastest supercomputer in the world at that time. When Maxwell came in the market, NVIDIA still had the bulk Kepler parts that were being sold for their high double precision compute, something that was amiss on Tesla Maxwell cards. While NVIDIA did launch Maxwell based Tesla cards later in the lineup which were aimed at the Cloud / Virtulization sectors, the top brass of NVIDIA's FP64 crunching Tesla cards are arriving again with the new Tesla Pascal graphics cards.

2 of 9

Pascal GPU Roadmap Slides From GTC 2015 Showcasing The Architecture Updates on The Latest GPU.

The new Pascal GP100 GPU that is aimed at the Tesla market first features three key technologies, NVLINK, FP16 and HBM2. Those go along well with the architectural improvements in NVIDIA's latest CUDA architecture.

NVIDIA Pascal GP100 With 10.6 TFLOPs Single and 5.3 TFLOPs Dual Precision Compute On A Single Graphics Card

NVIDIA Pascal GP100 GPU Architecture - The Building Blocks of NVIDIA's HPC Accelerator Chip - 3840 CUDA Cores, Preemption and Return of Double Precision With a Bang

Like previous Tesla GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers. GP100 achieves its colossal throughput by providing six GPCs, up to 60 SMs, and eight 512-bit memory controllers (4096 bits total). The Pascal architecture’s computational prowess is more than just brute force: it increases performance not only by adding more SMs than previous GPUs, but by making each SM more efficient. Each SM has 64 CUDA cores and four texture units, for a total of 3840 CUDA cores and 240 texture units.

Pascal GP100 Has Insane Clock Speeds - Near 1.5 GHz Boost Clocks

The Pascal GP100 comes with insane clock speeds of 1328 MHz core and 1480 MHz boost clock which is an insane leap and shows how the clock speed will scale even higher with the smaller chips so we can expect to see around 1500 MHz+ Pascal GPUs on the consumer market.

GP100’s SM incorporates 64 single-precision (FP32) CUDA Cores. In contrast, the Maxwell and Kepler SMs had 128 and 192 FP32 CUDA Cores, respectively. The GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores, an instruction buffer, a warp scheduler, and two dispatch units. While a GP100 SM has half the total number of CUDA Cores of a Maxwell SM, it maintains the same register file size and supports similar occupancy of warps and thread blocks.

NVIDIA Volta Tesla V100S Specs:

NVIDIA Tesla Graphics Card	Tesla K40(PCI-Express)	Tesla M40(PCI-Express)	Tesla P100(PCI-Express)	Tesla P100 (SXM2)	Tesla V100 (PCI-Express)	Tesla V100 (SXM2)	Tesla V100S (PCIe)
GPU	GK110 (Kepler)	GM200 (Maxwell)	GP100 (Pascal)	GP100 (Pascal)	GV100 (Volta)	GV100 (Volta)	GV100 (Volta)
Process Node	28nm	28nm	16nm	16nm	12nm	12nm	12nm
Transistors	7.1 Billion	8 Billion	15.3 Billion	15.3 Billion	21.1 Billion	21.1 Billion	21.1 Billion
GPU Die Size	551 mm2	601 mm2	610 mm2	610 mm2	815mm2	815mm2	815mm2
SMs	15	24	56	56	80	80	80
TPCs	15	24	28	28	40	40	40
CUDA Cores Per SM	192	128	64	64	64	64	64
CUDA Cores (Total)	2880	3072	3584	3584	5120	5120	5120
Texture Units	240	192	224	224	320	320	320
FP64 CUDA Cores / SM	64	4	32	32	32	32	32
FP64 CUDA Cores / GPU	960	96	1792	1792	2560	2560	2560
Base Clock	745 MHz	948 MHz	1190 MHz	1328 MHz	1230 MHz	1297 MHz	TBD
Boost Clock	875 MHz	1114 MHz	1329MHz	1480 MHz	1380 MHz	1530 MHz	1601 MHz
FP16 Compute	N/A	N/A	18.7 TFLOPs	21.2 TFLOPs	28.0 TFLOPs	30.4 TFLOPs	32.8 TFLOPs
FP32 Compute	5.04 TFLOPs	6.8 TFLOPs	10.0 TFLOPs	10.6 TFLOPs	14.0 TFLOPs	15.7 TFLOPs	16.4 TFLOPs
FP64 Compute	1.68 TFLOPs	0.2 TFLOPs	4.7 TFLOPs	5.30 TFLOPs	7.0 TFLOPs	7.80 TFLOPs	8.2 TFLOPs
Memory Interface	384-bit GDDR5	384-bit GDDR5	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM
Memory Size	12 GB GDDR5 @ 288 GB/s	24 GB GDDR5 @ 288 GB/s	16 GB HBM2 @ 732 GB/s12 GB HBM2 @ 549 GB/s	16 GB HBM2 @ 732 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 1134 GB/s
L2 Cache Size	1536 KB	3072 KB	4096 KB	4096 KB	6144 KB	6144 KB	6144 KB
TDP	235W	250W	250W	300W	250W	300W	250W

GP100’s SM has the same number of registers as Maxwell GM200 and Kepler GK110 SMs, but the entire GP100 GPU has far more SMs, and thus many more registers overall. This means threads across the GPU have access to more registers, and GP100 supports more threads, warps, and thread blocks in flight compared to prior GPU generations.

Overall shared memory across the GP100 GPU is also increased due to the increased SM count, and aggregate shared memory bandwidth is effectively more than doubled. A higher ratio of shared memory, registers, and warps per SM in GP100 allows the SM to more efficiently execute code. There are more warps for the instruction scheduler to choose from, more loads to initiate, and more per-thread bandwidth to shared memory (per thread).

On compute side, Pascal is going to take the next incremental step with double precision performance rated over 5.3 TFLOPs, which is more than double of what’s offered on the last generation FP64 enabled GPUs. As for single precision performance, we will see the Pascal GPUs breaking past the 10 TFLOPs barrier with ease. The chip comes with 4 MB of L2 cache. The GPU is in volume production and will be arriving to HPC markets very soon. On the mixed precision market, the Tesla P100 can achieve a maximum of 21 TFLOPs of FP16 compute performance which can process workloads at twice the compute precision of FP32.

Because of the importance of high-precision computation for technical computing and HPC codes, a key design goal for Tesla P100 is high double-precision performance. Each GP100 SM has 32 FP64 units, providing a 2:1 ratio of single- to double-precision throughput. Compared to the 3:1 ratio in Kepler GK110 GPUs, this allows Tesla P100 to process FP64 workloads more efficiently.

NVIDIA Pascal is Built on TSMC's 16nm FinFET Process Node

The chip is based on the 16nm FinFET process which leads to efficiency improvements and better performance per watt but with Pascal, double precision compute returns with a bang. Maxwell which is NVIDIA’s current gen architecture made some serious gains in the performance per watt department and Pascal is expected to keep the tradition move forward.

TSMC’s 16FF+ (FinFET Plus) technology can provide above 65 percent higher speed, around 2 times the density, or 70 percent less power than its 28HPM technology. Comparing with 20SoC technology, 16FF+ provides extra 40% higher speed and 60% power saving. By leveraging the experience of 20SoC technology, TSMC 16FF+ shares the same metal backend process in order to quickly improve yield and demonstrate process maturity for time-to-market value. via TSMC

GPU Architecture	NVIDIA Fermi	NVIDIA Kepler	NVIDIA Maxwell	NVIDIA Pascal
GPU Process	40nm	28nm	28nm	16nm (TSMC FinFET)
Flagship Chip	GF110	GK210	GM200	GP100
GPU Design	SM (Streaming Multiprocessor)	SMX (Streaming Multiprocessor)	SMM (Streaming Multiprocessor Maxwell)	SMP (Streaming Multiprocessor Pascal)
Maximum Transistors	3.00 Billion	7.08 Billion	8.00 Billion	15.3 Billion
Maximum Die Size	520mm2	561mm2	601mm2	610mm2
Stream Processors Per Compute Unit	32 SPs	192 SPs	128 SPs	64 SPs
Maximum CUDA Cores	512 CCs (16 CUs)	2880 CCs (15 CUs)	3072 CCs (24 CUs)	3840 CCs (60 CUs)
FP32 Compute	1.33 TFLOPs(Tesla)	5.10 TFLOPs (Tesla)	6.10 TFLOPs (Tesla)	~12 TFLOPs (Tesla)
FP64 Compute	0.66 TFLOPs (Tesla)	1.43 TFLOPs (Tesla)	0.20 TFLOPs (Tesla)	~6 TFLOPs(Tesla)
Maximum VRAM	1.5 GB GDDR5	6 GB GDDR5	12 GB GDDR5	16 / 32 GB HBM2
Maximum Bandwidth	192 GB/s	336 GB/s	336 GB/s	720 GB/s - 1 TB/s
Maximum TDP	244W	250W	250W	300W
Launch Year	2010 (GTX 580)	2014 (GTX Titan Black)	2015 (GTX Titan X)	2016

NVIDIA Pascal GP100 Is The First Single Chip GPU With HBM2 To Achieve 1 TB/s Bandwidth

Under the Tesla brand, NVIDIA will be introducing a range of HPC cards based on their GP100 GPU core which utilizes the Pascal architecture and delivers a behemoth 5.3 TFLOPs of double precision compute along with 16 GB of HBM2 VRAM clocked at 2 Gbps to deliver 1 TB/s bandwidth. This makes Pascal GP100 the first single GPU to achieve the 1 TB/s bandwidth which is an insane feat in itself. Not only is that insane, but GP100 is also the first graphics card in the world to feature the next-gen memory standard, HBM2 from Samsung.

The HBM2 VRAM has a lot of advantages in the graphics sector. Not only is it faster but it's also scalable down and up to several different SKUs. The HBM2 VRAM has a much higher bus than GDDR5 memory, it comes with up to 1 TB/s bandwidth and less but not least, it allows HPC class graphics cards to feature up to 16 GB of VRAM which is crazy.

The next generation of NVIDIA Tesla GPUs which will be shipping to HPC users this year are already equipped and ready with HBM2 VRAM. NVIDIA is the first graphics card company to feature HBM2 on their GPUs with competition a whole year away from launching their HBM2 powered chips.

NVIDIA GP100 is a 12 TFLOPs GPU, Full Fat SKU Yet To Arrive With 32 GB HBM2

One of the surprising thing about today's announcement is that the Tesla P100 isn't based on the full fat GP100 GPU but a cut down version with 3584 CUDA Cores. The actual chip is a behemoth in terms of design, featuring up to 3840 CUDA Cores and 32 GB of HBM2 memory. Its possible that we will see a standard graphics board design later in the roadmap which will be able to achieve full 12 TFLOPs of processing power on board the new GP100 graphics processing unit.

GPU Family	AMD Vega	AMD Navi	NVIDIA Pascal	NVIDIA Volta
Flagship GPU	Vega 10	Navi 10	NVIDIA GP100	NVIDIA GV100
GPU Process	14nm FinFET	7nm FinFET	TSMC 16nm FinFET	TSMC 12nm FinFET
GPU Transistors	15-18 Billion	TBC	15.3 Billion	21.1 Billion
GPU Cores (Max)	4096 SPs	TBC	3840 CUDA Cores	5376 CUDA Cores
Peak FP32 Compute	13.0 TFLOPs	TBC	12.0 TFLOPs	>15.0 TFLOPs (Full Die)
Peak FP16 Compute	25.0 TFLOPs	TBC	24.0 TFLOPs	120 Tensor TFLOPs
VRAM	16 GB HBM2	TBC	16 GB HBM2	16 GB HBM2
Memory (Consumer Cards)	HBM2	HBM3	GDDR5X	GDDR6
Memory (Dual-Chip Professional/ HPC)	HBM2	HBM3	HBM2	HBM2
HBM2 Bandwidth	484 GB/s (Frontier Edition)	>1 TB/s?	732 GB/s (Peak)	900 GB/s
Graphics Architecture	Next Compute Unit (Vega)	Next Compute Unit (Navi)	5th Gen Pascal CUDA	6th Gen Volta CUDA
Successor of (GPU)	Radeon RX 500 Series	Radeon RX 600 Series	GM200 (Maxwell)	GP100 (Pascal)
Launch	2017	2019	2016	2017

NVIDIA's NVLINK Is a Fast GPU Interconnect Fabric With Speeds of 160 GB/s - Backbone of NVIDIA Powered Supercomputers

The Pascal GP100 GPU is a server and workstation class chip and since it is aimed at the HPC market first, the GPU would also introduce NVLINK which is the next generation Unified Virtual Memory link with Gen 2.0 Cache coherency features and 5 – 12 times the bandwidth of a regular PCIe connection. This will solve many of the bandwidth issues that high performance GPUs currently face.

2 of 9

NVLINK will allow several GPUs to be connected in parallel in HPC focused platforms that will feature several nodes fitted with Pascal GPUs for compute oriented workloads. The latest NVLINK interconnect path will allow multi-processors featured inside HPC blocks to have faster interconnect than traditional PCI-e Gen3 lanes up to 160 GB/s speeds. Pascal GPUs will also feature Unified memory support allowing the CPU and GPU to share the same memory pool and finally we have Mixed precision support. NVLINK will be featured in PCs using ARM64 chips and some x86 powered HPC servers that utilize OpenPower, Tyan and Quantum solutions.

The Pascal based Tesla GPU is the next incremental step in HPC acceleration. This is NVIDIA's fastest graphics card to date for the professional market and we can't wait for NVIDIA to release a consumer version of the GPU later this year. As stated before, the Pascal GPU will be shipping to cloud services first in 2016 followed by OEMs in Q1 2017.

2 of 9

NVIDIA Tesla Graphics Cards Comparison:

Tesla Graphics Card Name	NVIDIA Tesla M2090	NVIDIA Tesla K40	NVIDIA Telsa K80	NVIDIA Tesla P100	NVIDIA Tesla V100
GPU Architecture	Fermi	Kepler	Maxwell	Pascal	Volta
GPU Process	40nm	28nm	28nm	16nm	12nm
GPU Name	GF110	GK110	GK210 x 2	GP100	GV100
Die Size	520mm2	561mm2	561mm2	610mm2	815mm2
Transistor Count	3.00 Billion	7.08 Billion	7.08 Billion	15 Billion	21.1 Billion
CUDA Cores	512 CCs (16 CUs)	2880 CCs (15 CUs)	2496 CCs (13 CUs) x 2	3840 CCs	5120 CCs
Core Clock	Up To 650 MHz	Up To 875 MHz	Up To 875 MHz	Up To 1480 MHz	Up To 1455 MHz
FP32 Compute	1.33 TFLOPs	4.29 TFLOPs	8.74 TFLOPs	10.6 TFLOPs	15.0 TFLOPs
FP64 Compute	0.66 TFLOPs	1.43 TFLOPs	2.91 TFLOPs	5.30 TFLOPs	7.50 TFLOPs
VRAM Size	6 GB	12 GB	12 GB x 2	16 GB	16 GB
VRAM Type	GDDR5	GDDR5	GDDR5	HBM2	HBM2
VRAM Bus	384-bit	384-bit	384-bit x 2	4096-bit	4096-bit
VRAM Speed	3.7 GHz	6 GHz	5 GHz	737 MHz	878 MHz
Memory Bandwidth	177.6 GB/s	288 GB/s	240 GB/s	720 GB/s	900 GB/s
Maximum TDP	250W	300W	235W	300W	300W