Stream benchmark

The stream benchmark aims at measuring memory bandwidth performance of the architecture. We simply add two vectors, in what is our “Hello World” program. Simplified version of the C# code can be found on our SDK on github.

As for Expm1 benchmark we run the generated code against the best handwritten native code we could write. This allows us to compare performance differences between hybridized code and handwritten native code using intrinsics.

[EntryPoint("run")]
public static void Run(int N, double[] a, double[] b)
{
    int start = threadIdx.x + blockDim.x * blockIdx.x;
    int step = blockDim.x * gridDim.x;
    for(int k = start; k < N; k += step) 
    {
        a[k] += b[k];
    }
}

No Hint

In a first version, we don’t give any hint about alignment or index coalescence. This has little to no effect on CUDA machines, and hardware prefetch looks doing a very good job on Intel Xeon machines. However, this leads to a gather instruction, which has a little performance penalty associated on Xeon Phi. We’ll see later how to restore performance.

Architecture Generated Handwritten Ratio
NVIDIA- P100 479.9 495.4 96.8%
NVIDIA – K20C 167 185 90.2%
INTEL – XEON PHI – 7210 327.7 381.7 85.8%
INTEL – Xeon E5 1620 v3 – 3.5 GHz 34.4 35 98.3%

Coalesced – Aligned

In a second version, we specify that our backend storage is an aligned memory location, and that index will stay coalesced and aligned:

[EntryPoint("StreamDouble")]
public static void StreamDouble(alignedstorage_double a, alignedstorage_double b, int offset, int count)
{
    alignedindex start = offset + threadIdx.x + blockIdx.x * blockDim.x ;
    alignedindex end = count + offset;
    for (alignedindex i = start; i < end; i += blockDim.x * gridDim.x)
    {
        a[i] = a[i] + b[i];
    }
}

Architecture Generated Handwritten Ratio
INTEL – XEON PHI – 7210 373.8 381.7 97.9%

Tags: , ,