CuSharp.CrossCompiler 1.0.0

dotnet add package CuSharp.CrossCompiler --version 1.0.0
NuGet\Install-Package CuSharp.CrossCompiler -Version 1.0.0
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="CuSharp.CrossCompiler" Version="1.0.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add CuSharp.CrossCompiler --version 1.0.0
#r "nuget: CuSharp.CrossCompiler, 1.0.0"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install CuSharp.CrossCompiler as a Cake Addin
#addin nuget:?package=CuSharp.CrossCompiler&version=1.0.0

// Install CuSharp.CrossCompiler as a Cake Tool
#tool nuget:?package=CuSharp.CrossCompiler&version=1.0.0

logo_crop

A GPU Compute Framework for .NET

CuSharp-Build CuSharp-Test

The Thesis

This project was created as a Bachelors-thesis at the University of Applied Sciences of Eastern Switzerland (OST). The main-document of the thesis, describing this project in detail can be found here (UPDATEME).

Project Layout

  • CuSharp: All parts of the frontend of the framework.
  • CuSharp.AOTC: An executable, used to AOT-compile C#-methods to PTX-Kernels
  • CuSharp.CrossCompiler: The crosscompiler compiling MSIL-opcodes to PTX instructions
  • CuSharp.NVVMBinder: Bindings for libNVVM
  • CuSharp.PerformanceEvaluation: Examples used to evaluate the performance of the framework
  • CuSharp.Tests: Unit and integration tests used to test the functionality of the framework
  • CuSharp.MandelbrotExample: An example WPF application, generating mandelbrot-sets, using CuSharp

Nuget-Packages:

  • To be announced

Examples

Add two int arrays

[Kernel]
static void IntAdditionKernel (int[] a , int[] b , int[] result)
{
  int index = KernelTools.BlockIndex.X * KernelTools.BlockDimensions.X + KernelTools.ThreadIndex.X;
  result[index] = a[index] + b[index];
}

public void Launch()
{
  var device = Cu.GetDefaultDevice();

  var arrayA = new int [] {1 ,2 ,3};
  var arrayB = new int [] {4 ,5 ,6};
  var deviceArrayA = device.Copy(arrayA);
  var deviceArrayB = device.Copy(arrayB);
  var deviceResultArray = device.Allocate<int>(3);
  var gridDimensions = (1,1,1);
  var blockDimensions = (3,1,1);
  device.Launch(IntAdditionKernel, gridDimensions, blockDimensions, deviceArrayA, deviceArrayB, deviceResultArray);
  var arrayResult = device.Copy(deviceResultArray);
}

Matrix Multiplication Kernel

[Kernel]
public static void MatrixMultiplication<T>(T[] a, T[] b, T[] c, int matrixWidth) where T : INumber<T>, new()
{
    var row = KernelTools.BlockDimension.Y * KernelTools.BlockIndex.Y + KernelTools.ThreadIndex.Y;
    var col = KernelTools.BlockDimension.X * KernelTools.BlockIndex.X + KernelTools.ThreadIndex.X;
    T result = new T(); 
    if (row < matrixWidth && col < matrixWidth)
    {
        for (int i = 0; i < matrixWidth; i++)
        {
            //KernelTools.SyncThreads();
            result += a[matrixWidth * row + i] * b[i * matrixWidth + col];
        }

        c[row * matrixWidth + col] = result;
    }
}

Matrix Multiplication Kernel using Shared Memory

[Kernel(ArrayMemoryLocation.SHARED)]
public static void TiledIntMatrixMultiplication<T>(T[] a, T[] b, T[] c, int matrixWidth, int tileWidth, int nofTiles) where T : INumber<T>, new()
{
    var tx = KernelTools.ThreadIndex.X;
    var ty = KernelTools.ThreadIndex.Y;
    var col = KernelTools.BlockIndex.X * tileWidth + tx;
    var row = KernelTools.BlockIndex.Y * tileWidth + ty;

    var aSub = new T[1024];
    var bSub = new T[1024];

    T sum = new T();
    for (int tile = 0; tile < nofTiles; tile++)
    {
        if (row < matrixWidth && tile * tileWidth + tx < matrixWidth)
        {
            aSub[ty * tileWidth + tx] = a[row * matrixWidth + tile * tileWidth + tx];
        }

        if (col < matrixWidth && tile * tileWidth + ty < matrixWidth)
        {
            bSub[ty * tileWidth + tx] = b[(tile * tileWidth + ty) * matrixWidth + col];
        }

        KernelTools.SyncThreads();

        if (row < matrixWidth && col < matrixWidth)
        {
            for (int ksub = 0; ksub < tileWidth; ksub++)
            {
                if (tile * tileWidth + ksub < matrixWidth)
                {
                    sum += aSub[ty * tileWidth + ksub] * bSub[ksub * tileWidth + tx];
                }
            }
        }
        KernelTools.SyncThreads();
    }
    if (row < matrixWidth && col < matrixWidth)
    {
        c[row * matrixWidth + col] = sum;
    }
}

Complete Examples

More complete examples can be found in the following project directories:

  • CuSharp.MandelbrotExample: A WPF-Project visualizing Mandelbrot-sets using CuSharp
  • CuSharp.PerformanceEvaluation: A console-application measuring the performance of matrix-multiplications

API

Static Class: Cu

Properties

  • bool EnableOptimizer: Enables or disables the built-in optimizer. Default: True (in Debug mode), False (in Release mode).
  • string AotKernelFolder: Specifies the folder where the framework should look for kernels that were ahead-of-time compiled.

Static Methods

  • IEnumerator<(int, string)> GetDeviceList(): Returns a list of pairs of device-id and device-name.
  • CuDevice GetDefaultDevice(): Returns a handle for the device with ID: 0.
  • CuDevice GetDeviceById(int deviceId): Returns a for the device with ID: deviceId.
  • CuEvent CreateEvent(): Returns a handle to a Cuda-Event used to measure performance.

Class: CuDevice

  • Implements IDisposable

Methods

  • string ToString(): Returns the devices name.
  • void Synchronize(): Blocks until all tasks on the device are finished.
  • Tensor<T[]> Allocate<T>(int size): Allocates an array of size elements on the device and returns its handle.
  • Tensor<T[,]> Allocate<T>(int sizeX, int sizeY): allocates a 2D-array of size sizeX * sizeY on the device and returns its handle.
  • Tensor<T[]> Copy<T>(T[] hostTensor): Copies hostTensor to the device and returns a handle to the copied array.
  • Tensor<T[,]> Copy<T>(T[,] hostTensor): Copies hostTensor to the device and returns a handle to the copied array.
  • Tensor<T> CreateScalar<T>(T hostScalar): Copies hostScalar to the device and returns a handle to the copied value.
  • T[] Copy<T>(Tensor<T[]> deviceTensor): Copies deviceTensor from the device and returns the array.
  • T[,] Copy<T>(Tensor<T[,]> deviceTensor): Copies deviceTensor from the device and returns the 2D-array.
  • void Launch<T1, ..., TN>(Action<T1, ..., TN> kernel, (uint,uint,uint) gridDimensions, (uint,uint,uint) blockDimensions, Tensor<T1> param1, ... , Tensor<TN> paramN): JIT-compiles (if needed) and launches kernel on the device with the specified dimensions and Tensor<T>-parameters.
  • void Dispose(): Disposes all allocated ressources of the device-handle.

Interface: ICuEvent

  • Implements IDisposable

Methods

  • void Record(): Records the point in time this method-was called relative to the GPU-Runtime.
  • float GetDeltaTo(CuEvent event): Returns the time delta between this CuEvent and event.
  • void Dispose(): Disposes all allocated ressources of the event-handle.

Static Class: KernelTools

  • A class to be used inside the kernel to access GPU-capabilities.
  • The properties below are compiled to NVVM intrinsic functions. The properties all point to a corresponding functor that by default throws an exception. The corresponding functors can be overriden to repurpose the KernelTools Properties.

Properties (to be used only inside kernels)

  • (uint X, uint Y, uint Z) GridDimension: Returns the grid dimensions of the current kernel launch.
  • (uint X, uint Y, uint Z) BlockDimension: Returns the block dimension of the current kernel launch.
  • (uint X, uint Y, uint Z) BlockIndex: Returns the block index inside the grid.
  • (uint X, uint Y, uint Z) ThreadIndex: Returns the thread index relative to the threads block.
  • uint WarpSize: Returns the warpsize of the executing device.
  • Action SyncThreads: Waits until all threads inside the current block reach this point when called.
  • Action GlobalThreadFence: Halts until all writes to global and shared memory of the current thread are visible to other threads when called.
  • Action SystemThreadFence: Halts until all writes (system wide) of the current threaad are visible to other threads when called.

Dependencies

Product Compatible and additional computed target framework versions.
.NET net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on CuSharp.CrossCompiler:

Package Downloads
CuSharp

Package Description

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.0 153 6/14/2023