Intel Xeon PHI

The Intel Xeon PHI is an implementation of the MIC (Many Integrated Core) architecture.

It holds several independent cores (61 in our setup), with 512 bits vector units[1]. Each core is hyper-threaded with up to four threads. Vector operations are very similar to SSE or AVX, yet much more complete. Moreover the new gather and scatter operations ease the vector access to memory performing a lookup in a single instruction.

Memory bound or compute bound

One of the metrics we analyse is the ratio between compute raw performance and memory bandwidth. This provides, as an asymptotic behaviour, the number of operations that can be performed per memory operation. It helps defining the limit between memory-bound and compute-bound problems.

Chip Bandwidth Single Precision ratio Double Precision ratio
SE10P 352 2130 24.2 1065 24.2

Bandwidth benchmark

We analyse the read bandwidth of the architecture (intel xeon phi), with two tests: ECC and no-ECC, depending on the criticity of the reliability of the memory.

Chip Peak ECC ratio No-ECC ratio
SE10P 352 162.08 46.0% 168.04 47.9%

Note on madd and GFLOPS

Not every algorithm can make full use of the madd operation. In this document, we rather consider madd as another floating-point operation kind. Most architectures have one-cycle madd, or at least same cycle-count than add or mul; we thus consider it as a single flop. In that concern, the raw compute power of hardware is halved compared to marketing figures. Algorithms reconstructing multiply-add instructions based on evaluation graph are well spread in compilers.

Compute benchmark

For this benchmark, we use a Taylor expansion of the expm1 function. We know the number of operations, and no branching occurs.

Chip Peak (SP) Single Precision ratio Peak (DP) Double Precision ratio
SE10P 1065 879 82.5% 533 440 82.5%

Memory-Compute limit revisited

We finally revisit the first metric, with the achieved performances.

Chip Bandwidth Single Precision ratio Double Precision ratio
SE10P 168.02 879 22 440 22