The Kepler K20[1] is made of SMX, which can be closest compared to CPU cores. Each SMX has its own cache, instruction dispatching units, memory interface. Kepler SMX (counting 14 on K20X) holds 192 single precision floating-point units, each of which can do a multiply-add in a single cycle (732MHz for the clock of the K20X). As a result, the announced peak performance is 3.95 Tflops. It also holds 64 double precision floating-point units, with same instruction throughput, announcing 1.31 Tflops.

Work distribution on a Kepler is organized in warps of 32 entries. Each thread within the same warp doing the same operation, with potential skipping, we can risk an analogy to CPU vector units (current AVX systems having 8 single precision entries). Each SMX has four Warp Schedulers with two Dispatch units each. Each warp can schedule up to two instructions per cycle[2].

Each SMX can run several contexts at the same time. This context distribution is somehow flexible, but is best performed if instructions are the same (note the single instruction cache per SMX). The total number of “threads” ran at the same time is up to 2048, that would count for 64 warps at the same time. Hiding latency of some operations (such as access to memory) requires a maximization of warps active at the same time.

Note the number of registers available is 2Mbits for each SMX, for a rough total of 26 Mbits for a K20c. This large register file has to be shared amongst the active warps, narrowing it down to 1024 bits per entry; that is 32 registers of 32 bits.

### Memory bound or compute bound

One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.

Chip | Bandwidth | Single Precision | Ratio | Double Precision | Ratio |
---|---|---|---|---|---|

K20C | 208 | 3519 | 67.7 | 1173 | 45.1 |

K20X | 250 | 3951 | 63.2 | 1317 | 42.1 |

K40 | 288 | 4291 | 59.6 | 1430 | 39.7 |

### Bandwidth benchmark

We analyse the read bandwidth of the architecture, with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip | Peak | ECC | Ratio | No-ECC | Ratio |
---|---|---|---|---|---|

K20C | 208 | 154.30 | 74.2% | 184.99 | 88.9% |

K20X | 250 | 182.68 | 73.2 | 220.12 | 88.2% |

K40 | 288 | 192.65 | 68.6% | 217.29 | 81.0% |

### Note on madd and GFLOPS

Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.

### Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

On Kepler, there are 4 warp schedulers and 6 warp instruction units. Hence using more than 66.6% of the hardware requires the usage of Instruction Level Parallelism (ILP). This feature is not available programmatically, we rather need to provide the compiler and driver with opportunities to use it.

Chip | Peak (SP) | Single Precision | ratio | Peak (DP) | Double Precision | ratio |
---|---|---|---|---|---|---|

K20C | 1760 | 1418 | 80.6% | 586 | 540 | 92.2% |

K20X | 1968 | 1599 | 81.3% | 656 | 591 | 90.1% |

K40 | 2146 | 1608 | 74.9% | 715 | 632 | 88.4% |

### Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip | Bandwidth | Single Precision | Ratio | Double Precision | Ratio |
---|---|---|---|---|---|

K20C | 154.30 | 1418 | 37 | 540 | 28 |

K20X | 182.68 | 1599 | 35 | 591 | 26 |

K40 | 197.65 | 1608 | 33 | 632 | 26 |