Switch executor#
You have a working Ginkgo program on one executor and want to run the same program on a different one — typically CPU → GPU. In Ginkgo, the executor is a runtime argument: switch the constructor and everything downstream follows.
The change#
The only line that differs between backends is the executor constructor.
The rest of the program — matrix construction, solver factory, apply —
is identical.
// CPU, single-threaded reference baseline:
auto exec = gko::ReferenceExecutor::create();
// CPU, OpenMP-parallel:
auto exec = gko::OmpExecutor::create();
// NVIDIA GPU (CUDA). Pair with a host executor for staging.
auto exec = gko::CudaExecutor::create(0, gko::OmpExecutor::create());
// AMD GPU (HIP):
auto exec = gko::HipExecutor::create(0, gko::OmpExecutor::create());
// Intel GPU (SYCL):
auto exec = gko::DpcppExecutor::create(0, gko::OmpExecutor::create());
The first argument to GPU executors is the device id; the second is the
host executor that the GPU executor uses for host-side allocations
and copy-staging. Pair the GPU executor with OmpExecutor (or
ReferenceExecutor if OpenMP is not built).
Picking the device#
For multi-GPU nodes, query the count and pick by rank or environment:
int n = gko::CudaExecutor::get_num_devices(); // or HipExecutor / DpcppExecutor
auto exec = gko::CudaExecutor::create(rank % n, gko::OmpExecutor::create());
For MPI ranks on multi-GPU nodes, the helper
gko::experimental::mpi::map_rank_to_device_id(comm, n) gives each rank a
device id based on its node-local rank.
What happens to existing data#
Switching the executor in main() is enough — every LinOp, Dense,
and array constructed afterwards lives on the new executor. If you
already have host-resident matrices and want to copy them across, use
gko::clone(target_exec, obj) or the type’s own clone(target_exec):
auto A_host = gko::matrix::Csr<>::create(host_exec);
A_host->read(host_matrix_data);
auto A_dev = gko::clone(exec, A_host); // shallow type, deep data
For zero-copy from application-owned buffers see Zero-copy from application memory.
Common pitfalls#
Operands on different executors.
apply(b, x)requiresb,x, and the operator to share the same executor. If they don’t, the call silently clones to align them — which is correct but copies data each call. Construct everything on the target executor from the start.at(i, j)is host-only. DereferencingDense::aton a vector whose executor is a GPU is undefined behaviour.cloneto the host executor first when you need to read or write individual entries.Share the host master across GPU executors. Each
OmpExecutorallocates its own thread pool at construction. If you build several GPU executors (e.g. one per MPI rank’s local device), construct oneOmpExecutorand pass the sameshared_ptrto each GPU executor rather than callinggko::OmpExecutor::create()inline at every GPU-executor construction site.
See also
Move data between executors —
clonevscopy_from.Executor model — the conceptual reference.