The Kepler K20[1] is made of SMX, which can be closest compared to CPU cores. Each SMX has its own cache, instruction dispatching units, memory interface. Kepler SMX (counting 14 on K20X) holds 192 single precision floating-point units, each of which can do a multiply-add in a single cycle (732MHz for the clock of the K20X). As a result, the announced peak performance is 3.95 Tflops. It also holds 64 double precision floating-point units, with same instruction throughput, announcing 1.31 Tflops.

Work distribution on a Kepler is organized in warps of 32 entries. Each thread within the same warp doing the same operation, with potential skipping, we can risk an analogy to CPU vector units (current AVX systems having 8 single precision entries). Each SMX has four Warp Schedulers with two Dispatch units each. Each warp can schedule up to two instructions per cycle[2].

Each SMX can run several contexts at the same time. This context distribution is somehow flexible, but is best performed if instructions are the same (note the single instruction cache per SMX). The total number of “threads” ran at the same time is up to 2048, that would count for 64 warps at the same time. Hiding latency of some operations (such as access to memory) requires a maximization of warps active at the same time.

Note the number of registers available is 2Mbits for each SMX, for a rough total of 26 Mbits for a K20c. This large register file has to be shared amongst the active warps, narrowing it down to 1024 bits per entry; that is 32 registers of 32 bits.

Memory bound or compute bound

One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.

Chip Bandwidth Single Precision Ratio Double Precision Ratio
K20C 208 3519 67.7 1173 45.1
K20X 250 3951 63.2 1317 42.1
K40 288 4291 59.6 1430 39.7

Bandwidth benchmark

We analyse the read bandwidth of the architecture, with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip Peak ECC Ratio No-ECC Ratio
K20C 208 154.30 74.2% 184.99 88.9%
K20X 250 182.68 73.2 220.12 88.2%
K40 288 192.65 68.6% 217.29 81.0%

Note on madd and GFLOPS

Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.

Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

Nvidia Kepler

On Kepler, there are 4 warp schedulers and 6 warp instruction units. Hence using more than 66.6% of the hardware requires the usage of Instruction Level Parallelism (ILP). This feature is not available programmatically, we rather need to provide the compiler and driver with opportunities to use it.

Chip Peak (SP) Single Precision ratio Peak (DP) Double Precision ratio
K20C 1760 1418 80.6% 586 540 92.2%
K20X 1968 1599 81.3% 656 591 90.1%
K40 2146 1608 74.9% 715 632 88.4%

Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip Bandwidth Single Precision Ratio Double Precision Ratio
K20C 154.30 1418 37 540 28
K20X 182.68 1599 35 591 26
K40 197.65 1608 33 632 26

[2] “Furthermore, some degree of ILP in conjuction with TLP is required by Kepler GPUs in order to approach peak performance, since SMX’s warp scheduler issues one or two independent instructions from each of four warps per clock.” – from Kepler Tunig Guide