1.Hybridizer HOWTO – My First Project

Hybridizer is a compiler that lets you run a single version of your C# or Java code on any harware.
In this tutorial, we will explain how to create a first project in C# targeting GPU. We will illustrate with hybridizer essentials.

Warning/Disclaimer

We don’t support the entire C# language or .Net Framework. Main known limitations are:

  • Heap allocation (new A()) from device thread (except for arrays)
  • System.Collection is not supported
  • string type is not supported

Prerequisites

Software

You first need to install the following software:

  • Visual Studio 2012, 2013, 2015 or 2017. Warning, in Visual Studio 2017 with CUDA 9.2 or earlier you need to install v140 toolset from Visual Studio Installer.
  • Ensure your Visual installation supports C++ and not just C#.
  • CUDA toolkit 8.0, 9.0, 9.1, 9.2 or 10.0
  • Any version of the Hybridizer, including the free version, Hybridizer Essentials

License

You need to request a Hybridizer Subscription.
Subscriptions are our new licensing model for Hybridizer Essentials. They can migrate from one machine to another (only one machine being authorized at a time).
Trial are unique and attached to your email address, while you can purchase as many commercial subscriptions as you want.
Either you already purchased one, or you can request a trial for Hybridizer Essentials. To do that, click on Hybridizer->License Settings in Visual Studio:

hybridizer configuration

If you opted for the trial, provide you email address and click Subscribe:

hybridizer license settings

You should receive your license in your mailbox soon. If not, please contact us or create an issue on github.

Open your mailbox, and select the license text as follow:

hybridizer license mail

Paste this text in the license textbox in Hybridizer Configuration, and click Refresh License.

Hybridizer should validate the subscription, assign a license to your machine, and tell you the following:

hybridizer valid license

First project

You have two options:

Brand new project

From Visual Studio, click File, New, Project. Choose C#, Altimesh:

project template

Build C# project, then native generated project, and run.

From existing C# project

First create or open and existing C# console application.

Right click on the project in the solution explorer, and select “Hybridize Project”:

This step will create a native CUDA project and add it your solution. It will handle the files generated by Hybridizer from your managed C# project.

Fill the requested fields and click “Generate”:

create-satellite-project

If everything worked correctly, several things happened in the background:

  • Your C# project now references Hybridizer.Runtime.CUDAImports. This assembly provides all the necessary attributes to hybridize methods, a CUDA wrapper, and a memory marshaller. We will come back on those in later posts.
  • A native project has been created and added to your solution. This project references two files, hybridizer.generated.cpp and hybridizer.wrappers.cu. The first one will contain a cubin module. The second will export native symbols.
  • If not already existing, an x64 platform configuration has been added to your solution.

Before building anything, change configuration to x64. 32 bit support is indeed being deprecated by NVIDIA, and nvrtc requires 64 bits.

Create a kernel

In your main class, add the following code:

        [EntryPoint]
        public static void Hello()
        {
            Console.Out.Write("Hello from GPU");
        }

The EntryPoint attribute tells the hybridizer to generate a CUDA kernel, as if you wrote:

__global__ void Hello() {
    printf("Hello from GPU\n");
}

You can now build the C# project, and the the satellite project. You can inspect generated file to see what hybridizer generated:

  • hybridizer.generated.cpp contains a big array of bytes, which is the device code of your kernel.
  • hybridizer.wrappers.cu exports a symbol:
    extern "C" DLL_PUBLIC int ConsoleApplication3x46Programx46Hello_ExternCWrapper_CUDA(...)
    {
    	CUresult cures ;                                                                                 
    	if (__hybridizer__gs_module.module_data == 0)                                                    
    	{                                                                                              
    		cures = cuModuleLoadData (&(__hybridizer__gs_module.module), __hybridizer_cubin_module_data) ; 
    		if (cures != CUDA_SUCCESS) return (int)cures ;                                                 
    	}                                                                                              
    	                                                                                                 
    	CUfunction __hybridizer__cufunc ;                                                                
    	                                                                                                 
    	cures = cuModuleGetFunction (&__hybridizer__cufunc, __hybridizer__gs_module.module, "ConsoleApplication3x46Programx46Hello") ;   
    	if (cures != CUDA_SUCCESS) return (int)cures ;                                                   
    	  // more generated code ...                                                                                                                                             
    	cures = cuLaunchKernel (__hybridizer__cufunc, ...) ; 
    	if (cures != CUDA_SUCCESS) return (int)cures ; 
    	int cudaLaunchRes = (int)::cudaPeekAtLastError ();                                                                                                     
    	if (cudaLaunchRes != 0) return cudaLaunchRes;                                                                                                          
    	int __synchronizeRes = (int)::cudaDeviceSynchronize () ;                                                                                               
    	return __synchronizeRes ;                                                                                                                              
    
    }
    

Run it

In your main method, add the following boilerplate code:


        static void Main(string[] args)
        {
            cuda.DeviceSynchronize();
            HybRunner runner = HybRunner.Cuda("ConsoleApplication3_CUDA.vs2015.dll").SetDistrib(1, 2);
            runner.Wrap(new Program()).Hello();
        }

with the appropriate generated dll name. This code:

  • registers the generated dll as a CUDA dll: HybRunner.Cuda(“ConsoleApplication3_CUDA.vs2015.dll”)
  • configure kernels calls to run with 1 block of 2 threads: SetDistrib(1, 2)
  • registers the current object as a kernel container: runner.Wrap(new Program())
  • runs the generated method.

Then run:

hello from gpu

Congratulations! You just successfully ran your first C# kernel on the GPU!

2.Hybridizer HOWTO — Hello World

Our hello world is the addition of two vector of elements. The C# code is downloadable from our github.

Hello World : simple work distribution

We start with a simple way to express parallelism: the Parallel.For construct, which is natively proposed by .Net. We place the EntryPoint attribute on the method tro trigger hybridization:

[EntryPoint]
public static void VectorAdd(double[] a, double[] b, int N)
{
    Parallel.For(0, N, (i) => { a[i] += b[i] });
}

As usual, we need to invoke this method with some boilerplate code.

Hello World : explicit work distribution

We can also use explicit work distribution, which is done using a CUDA-like syntax: threadIdx/blockDim, blockIdx/gridDim. This is customizable and names can be changed, but the concept is similar:


[EntryPoint]
public static void VectorAdd(double[] a, double[] b, int N)
    for (int k = threadIdx.x + blockDim.x * blockIdx.x ;
        k < count ; k += blockDim.x * gridDim.x)
    {
        a[k] += b[k];
    }
}

Explicit work distribution can be used (for example) to distribute work among a 2D-grid.

Grid configuration

To achieve hich bandwidth, we need to properly configure the grid. Using enough blocks and threads increases occupancy and can mask latency by running concurrent blocks. We do that as we would in CUDA:

cudaDeviceProp prop;
cuda.GetDeviceProperties(out prop, 0);
HybRunner runner = HybRunner.Cuda("HelloWorld_CUDA.dll").SetDistrib(prop.multiProcessorCount * 16, 128);

Performance measurements

We can now compile this in Release|x64 and profile the execution with nsight. We reach very high occupancy:

Hello world occupancy

We reach 337.8GB/s on a GTX 1080Ti (Pascal), which is 96% of bandwidth test on this GPU:

Hello World Bandwidth on 1080 Ti

3.Hybridizer HOW TO — Intrinsics

It’s often useful to use intrinsics or builtin functions provided by CUDA. You might also already have a very optimized cuda header which you’d like to reuse from your C# application. This can be done using IntrinsicFunction attribute. When generating the source code, the function call is replaced by the IntrinsicFunction name, and the contents of that function is ignored.

Intrinsics functions

Consider this code sample:

class IntrinsicFunction
    {
        [IntrinsicFunction("printf")]
        public static void printf(string format, double val)
        {
            Console.WriteLine(val);
        }

        [IntrinsicFunction("erf")]
        private static double Erf(double x)
        {  
            double ax = x > 0.0 ? x : -x;
            const double a1 = 0.254829592;
            const double a2 = -0.284496736;
            const double a3 = 1.421413741;
            const double a4 = -1.453152027;
            const double a5 = 1.061405429;
            const double p = 0.3275911;
            double t = 1.0 / (1.0 + p * x);
            double y = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.Exp(-ax * ax);

            return x > 0.0 ? y : -y;
        }

        [EntryPoint]
        public static void run()
        {
            printf("%.17lf\n", Erf(1.0));
        }

        public static void Run()
        {
            Console.WriteLine("IntrinsicFunction :: ");
            Console.WriteLine(":: C# :: ");
            run();

            HybRunner runner = HybRunner.Cuda("ConsoleApplication96_CUDA.vs2015.dll").SetDistrib(1, 1);
            dynamic wrapped = runner.Wrap(new IntrinsicFunction());

            Console.WriteLine(":: CUDA :: ");
            cuda.DeviceSynchronize();
            wrapped.run();
        }
    }

The C# implementation of Erf is too simple to be bug free and accurate. It’s better to rely on the CUDA implementation.
To do that, we decorate the C# function with the IntrinsicFunction attribute. Hybridizer will get the name property of that attribute, and replace calls to Erf by calls to the native erf function from cmath.

Intrinsics Type

Sometimes, we already have a good native CUDA implementation of some function. In that case we don’t want Hybridizer to process our C# symbol. We rather want it to use the one we provide in a custom header.

[IntrinsicInclude("myheader.cuh")]
    class IntrinsicType
    {
        [IntrinsicFunction("myfunction")]
        private static double myfunction(double x)
        {
            return 42.0;
        }

        [IntrinsicFunction("printf")]
        public static void printf(string format, double val)
        {
            Console.WriteLine(val);
        }

        [EntryPoint]
        public static void run()
        {
            printf("%.17lf\n", myfunction(3.0));
        }

        public static void Run()
        {
            Console.WriteLine("IntrinsicType :: ");
            Console.WriteLine(":: C# :: ");
            run();

            HybRunner runner = HybRunner.Cuda("ConsoleApplication96_CUDA.vs2015.dll").SetDistrib(1, 1);
            dynamic wrapped = runner.Wrap(new IntrinsicType());

            Console.WriteLine(":: CUDA :: ");
            cuda.DeviceSynchronize();
            wrapped.run();
        }
    }

The IntrinsicInclude attribute on the class will tell the Hybridizer to include this header in the generated file.
We therefore write a custom header:
#pragma once
__device__ inline double myfunction(double x) {
	return x * x + 2.0;
}

The C# version of myfunction won’t be processed and myheader.cuh will be included.
With Hybridizer Software Suite, no further modification is needed. However Hybridizer Essentials relies on nvrtc which doesn’t include headers automatically. You then need to provide the path of headers to Hybridizer, using “Additional JITTER Headers” options:
Additional Jitter headers

Results

You can download the above code from this archive. Running it will show that C# code have been properly replaced by native functions:

intrinsics calls

4.Hybridizer HOWTO — Libraries Integration

It is also possible to use/integrate existing libraries for which device functions are defined, hence extending the concept of intrinsic functions to intrinsic types.

Note that in that case, the functions do not need an implementation if no behavior is expected in plain C#.


[IntrinsicType("curandStateMRG32k3a_t")]
[IntrinsicIncludeCUDA("curand_kernel.h")]
[StructLayout(LayoutKind.Sequential)]
public unsafe struct curandStateMRG32k3a_t
{
    public fixed double s1[3];
    public fixed double s2[3];
    public int boxmuller_flag;
    public int boxmuller_flag_double;
    public float boxmuller_extra;
    public double boxmuller_extra_double;
    [IntrinsicFunction("curand_init")]
    public static void curand_init(ulong seed,
        ulong subsequence, ulong offset,
        out curandStateMRG32k3a_t state)
    { throw new NotImplementedException(); }
    [IntrinsicFunction("curand")] public uint curand()
    { throw new NotImplementedException(); }
    [IntrinsicFunction("curand_log_normal")]
    public float curand_log_normal(float mean, float stdev)
    { throw new NotImplementedException(); }

}