CUDA executor#
gko::CudaExecutor runs Ginkgo kernels on a single NVIDIA GPU using the CUDA runtime.
Construction#
The simplest form takes a device id and a host-side master:
auto host = gko::OmpExecutor::create();
auto gpu = gko::CudaExecutor::create(0, host); // device id 0
For more control, supply a custom allocator and a CUDA stream:
cudaStream_t stream;
cudaStreamCreate(&stream);
auto alloc = std::make_shared<gko::CudaAsyncAllocator>(stream);
auto gpu = gko::CudaExecutor::create(0, host, alloc, stream);
Allocators#
CudaExecutor supports several allocators (see Allocators for the full table):
CudaAllocator—cudaMalloc(default, synchronous).CudaAsyncAllocator—cudaMallocAsync(stream-ordered, lower overhead in hot loops).CudaUnifiedAllocator—cudaMallocManaged(unified shared memory).CudaHostAllocator—cudaHostAlloc(pinned host memory).
If you don’t pass an allocator argument, CudaExecutor::create(id, host) constructs a CudaAllocator (synchronous cudaMalloc).
Streams#
By default CudaExecutor uses the legacy default stream. The stream parameter type is CUstream_st* — this is the same type CUDA defines as cudaStream_t (a typedef). Ginkgo does not wrap streams in its own type; pass whatever you got from cudaStreamCreate(). Use an explicit stream to overlap kernel execution with host work or with other streams — the foundation for multi-stream pipelines and CUDA-graph integration.
Memory model#
Device memory is opaque to host code by default. Choosing CudaUnifiedAllocator or CudaHostAllocator produces pointers that are also valid on the host master, in which case gpu->memory_accessible(host) returns true and Ginkgo elides redundant copies between them.
Discovering devices#
auto count = gko::CudaExecutor::get_num_devices();
See also
API reference:
gko::CudaExecutor