CuSharp.CrossCompiler 1.0.0

.NET 7.0

dotnet add package CuSharp.CrossCompiler --version 1.0.0

NuGet\Install-Package CuSharp.CrossCompiler -Version 1.0.0

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="CuSharp.CrossCompiler" Version="1.0.0" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

paket add CuSharp.CrossCompiler --version 1.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: CuSharp.CrossCompiler, 1.0.0"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

// Install CuSharp.CrossCompiler as a Cake Addin
#addin nuget:?package=CuSharp.CrossCompiler&version=1.0.0

// Install CuSharp.CrossCompiler as a Cake Tool
#tool nuget:?package=CuSharp.CrossCompiler&version=1.0.0

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

logo_crop

A GPU Compute Framework for .NET

The Thesis

This project was created as a Bachelors-thesis at the University of Applied Sciences of Eastern Switzerland (OST). The main-document of the thesis, describing this project in detail can be found here (UPDATEME).

Project Layout

CuSharp: All parts of the frontend of the framework.
CuSharp.AOTC: An executable, used to AOT-compile C#-methods to PTX-Kernels
CuSharp.CrossCompiler: The crosscompiler compiling MSIL-opcodes to PTX instructions
CuSharp.NVVMBinder: Bindings for libNVVM
CuSharp.PerformanceEvaluation: Examples used to evaluate the performance of the framework
CuSharp.Tests: Unit and integration tests used to test the functionality of the framework
CuSharp.MandelbrotExample: An example WPF application, generating mandelbrot-sets, using CuSharp

Nuget-Packages:

To be announced

Examples

Add two int arrays

[Kernel]
static void IntAdditionKernel (int[] a , int[] b , int[] result)
{
  int index = KernelTools.BlockIndex.X * KernelTools.BlockDimensions.X + KernelTools.ThreadIndex.X;
  result[index] = a[index] + b[index];
}

public void Launch()
{
  var device = Cu.GetDefaultDevice();

  var arrayA = new int [] {1 ,2 ,3};
  var arrayB = new int [] {4 ,5 ,6};
  var deviceArrayA = device.Copy(arrayA);
  var deviceArrayB = device.Copy(arrayB);
  var deviceResultArray = device.Allocate<int>(3);
  var gridDimensions = (1,1,1);
  var blockDimensions = (3,1,1);
  device.Launch(IntAdditionKernel, gridDimensions, blockDimensions, deviceArrayA, deviceArrayB, deviceResultArray);
  var arrayResult = device.Copy(deviceResultArray);
}

Matrix Multiplication Kernel

[Kernel]
public static void MatrixMultiplication<T>(T[] a, T[] b, T[] c, int matrixWidth) where T : INumber<T>, new()
{
    var row = KernelTools.BlockDimension.Y * KernelTools.BlockIndex.Y + KernelTools.ThreadIndex.Y;
    var col = KernelTools.BlockDimension.X * KernelTools.BlockIndex.X + KernelTools.ThreadIndex.X;
    T result = new T(); 
    if (row < matrixWidth && col < matrixWidth)
    {
        for (int i = 0; i < matrixWidth; i++)
        {
            //KernelTools.SyncThreads();
            result += a[matrixWidth * row + i] * b[i * matrixWidth + col];
        }

        c[row * matrixWidth + col] = result;
    }
}

Matrix Multiplication Kernel using Shared Memory

[Kernel(ArrayMemoryLocation.SHARED)]
public static void TiledIntMatrixMultiplication<T>(T[] a, T[] b, T[] c, int matrixWidth, int tileWidth, int nofTiles) where T : INumber<T>, new()
{
    var tx = KernelTools.ThreadIndex.X;
    var ty = KernelTools.ThreadIndex.Y;
    var col = KernelTools.BlockIndex.X * tileWidth + tx;
    var row = KernelTools.BlockIndex.Y * tileWidth + ty;

    var aSub = new T[1024];
    var bSub = new T[1024];

    T sum = new T();
    for (int tile = 0; tile < nofTiles; tile++)
    {
        if (row < matrixWidth && tile * tileWidth + tx < matrixWidth)
        {
            aSub[ty * tileWidth + tx] = a[row * matrixWidth + tile * tileWidth + tx];
        }

        if (col < matrixWidth && tile * tileWidth + ty < matrixWidth)
        {
            bSub[ty * tileWidth + tx] = b[(tile * tileWidth + ty) * matrixWidth + col];
        }

        KernelTools.SyncThreads();

        if (row < matrixWidth && col < matrixWidth)
        {
            for (int ksub = 0; ksub < tileWidth; ksub++)
            {
                if (tile * tileWidth + ksub < matrixWidth)
                {
                    sum += aSub[ty * tileWidth + ksub] * bSub[ksub * tileWidth + tx];
                }
            }
        }
        KernelTools.SyncThreads();
    }
    if (row < matrixWidth && col < matrixWidth)
    {
        c[row * matrixWidth + col] = sum;
    }
}

Complete Examples

More complete examples can be found in the following project directories:

CuSharp.MandelbrotExample: A WPF-Project visualizing Mandelbrot-sets using CuSharp
CuSharp.PerformanceEvaluation: A console-application measuring the performance of matrix-multiplications

API

Static Class: Cu

Properties

bool EnableOptimizer: Enables or disables the built-in optimizer. Default: True (in Debug mode), False (in Release mode).
string AotKernelFolder: Specifies the folder where the framework should look for kernels that were ahead-of-time compiled.

Static Methods

IEnumerator<(int, string)> GetDeviceList(): Returns a list of pairs of device-id and device-name.
CuDevice GetDefaultDevice(): Returns a handle for the device with ID: 0.
CuDevice GetDeviceById(int deviceId): Returns a for the device with ID: deviceId.
CuEvent CreateEvent(): Returns a handle to a Cuda-Event used to measure performance.

Class: CuDevice

Implements IDisposable

Methods

string ToString(): Returns the devices name.
void Synchronize(): Blocks until all tasks on the device are finished.
Tensor<T[]> Allocate<T>(int size): Allocates an array of size elements on the device and returns its handle.
Tensor<T[,]> Allocate<T>(int sizeX, int sizeY): allocates a 2D-array of size sizeX * sizeY on the device and returns its handle.
Tensor<T[]> Copy<T>(T[] hostTensor): Copies hostTensor to the device and returns a handle to the copied array.
Tensor<T[,]> Copy<T>(T[,] hostTensor): Copies hostTensor to the device and returns a handle to the copied array.
Tensor<T> CreateScalar<T>(T hostScalar): Copies hostScalar to the device and returns a handle to the copied value.
T[] Copy<T>(Tensor<T[]> deviceTensor): Copies deviceTensor from the device and returns the array.
T[,] Copy<T>(Tensor<T[,]> deviceTensor): Copies deviceTensor from the device and returns the 2D-array.
void Launch<T1, ..., TN>(Action<T1, ..., TN> kernel, (uint,uint,uint) gridDimensions, (uint,uint,uint) blockDimensions, Tensor<T1> param1, ... , Tensor<TN> paramN): JIT-compiles (if needed) and launches kernel on the device with the specified dimensions and Tensor<T>-parameters.
void Dispose(): Disposes all allocated ressources of the device-handle.

Interface: ICuEvent

Implements IDisposable

Methods

void Record(): Records the point in time this method-was called relative to the GPU-Runtime.
float GetDeltaTo(CuEvent event): Returns the time delta between this CuEvent and event.
void Dispose(): Disposes all allocated ressources of the event-handle.

Static Class: KernelTools

A class to be used inside the kernel to access GPU-capabilities.
The properties below are compiled to NVVM intrinsic functions. The properties all point to a corresponding functor that by default throws an exception. The corresponding functors can be overriden to repurpose the KernelTools Properties.

Properties (to be used only inside kernels)

(uint X, uint Y, uint Z) GridDimension: Returns the grid dimensions of the current kernel launch.
(uint X, uint Y, uint Z) BlockDimension: Returns the block dimension of the current kernel launch.
(uint X, uint Y, uint Z) BlockIndex: Returns the block index inside the grid.
(uint X, uint Y, uint Z) ThreadIndex: Returns the thread index relative to the threads block.
uint WarpSize: Returns the warpsize of the executing device.
Action SyncThreads: Waits until all threads inside the current block reach this point when called.
Action GlobalThreadFence: Halts until all writes to global and shared memory of the current thread are visible to other threads when called.
Action SystemThreadFence: Halts until all writes (system wide) of the current threaad are visible to other threads when called.

Dependencies

Product	Compatible and additional computed target framework versions.
.NET	net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed.

Product

.NET

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net7.0
- CuSharp.NVVMBinder (>= 1.0.0)
- LLVMSharp (>= 5.0.0)

NuGet packages (1)

Showing the top 1 NuGet packages that depend on CuSharp.CrossCompiler:

Package	Downloads
CuSharp Package Description	147

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last updated
1.0.0	153	6/14/2023

Total 153

Current version 153

Per day average 0

CuSharp.CrossCompiler 1.0.0

The Thesis

Project Layout

Nuget-Packages:

Examples

Add two int arrays

Matrix Multiplication Kernel

Matrix Multiplication Kernel using Shared Memory

Complete Examples

API

Static Class: Cu

Properties

Static Methods

Class: CuDevice

Methods

Interface: ICuEvent

Methods

Static Class: KernelTools

Properties (to be used only inside kernels)

Dependencies

net7.0

NuGet packages (1)

GitHub repositories