Altimesh benchmarks -- expm1 -- floating point performance

This benchmark aims at measuring the floating point performance of the architecture. We use a Taylor expansion of the Expm1 function (exponential of x minus one), at some degree (13 in our case), without any branching. The number of floating point operations, whether it be addition, multiply or fused multiply and add (counted as a single flop in our case) is known ab initio.

We compute the evaluation of that function twelve times to avoid any compute/memory bound unknown hence being sure the calculations are compute bound.

double expm1(double x)
{
            return ((((((((((((((15.0 + x)
                * x + 210.0)
                * x + 2730.0)
                * x + 32760.0)
                * x + 360360.0)
                * x + 3603600.0)
                * x + 32432400.0)
                * x + 259459200.0)
                * x + 1816214400.0)
                * x + 10897286400.0)
                * x + 54486432000.0)
                * x + 217945728000.0)
                * x + 653837184000.0)
                * x + 1307674368000.0)
                * x * 7.6471637318198164759011319857881e-13;
}

We evaluate the function twelve times on a large array of values to exploit parallelism. Depending on the architecture and the flavor of the transformation, vectorization may occur.
We wrote the code in C# and generate target code using Hybridizer. On the other hand, we try to write the best native code for the target architecture. For Intel and IBM targets, code is written in C++ using intrinsics and not raw assembly. For CUDA targets, code is written in plain CUDA. We measure performance of both versions, and see how Hybridizer performs relatively to handwritten code.

Architecture	Generated	Handwritten	Ratio
NVIDIA- P100	1954.7	2360.6	82.8%
NVIDIA – K20C	505	551.1	91.6%
INTEL – XEON PHI – 7210	912.5	1003.3	90.9%
INTEL – Xeon E5 1620 v3 – 3.5 GHz	60.8	84.9	71.6%
INTEL – Core i7 6700 – 3.4 GHz	88.1	104.5	84.3%
IBM – POWER8 (2S)	188.2	220.1	85.5%

It can be noticed that on Broadwell (Xeon E5 1620), reached performance is quite far from peak. This is due to the fused-multiply-add instruction latency, which is 5 for a 0.5 throughput. That means we need to enqueue 10 independent instructions to saturate the pipeline. None of the back-end compilers we tried was capable of correctly interleave enough instructions. To reach peak on those architectures, it seems mandatory to write raw assembly. This will be further detailed in a dedicated blog post.

Tags: Benchmark, Flops