Poisson on GPU#
This tutorial picks up from Poisson on CPU and ports the solver to a GPU executor. We’re solving the same 1D Poisson problem — \(-u''(x) = -6x\) on \((0, 1)\) with \(u(0)=0,\ u(1)=1\) — but the matrix and the iteration now live on the device.
The whole port is three line changes plus one new accessor pattern. The point of this tutorial is to show that Ginkgo’s executor abstraction means “GPU-aware” code looks almost identical to CPU code.
Prerequisites#
You need Ginkgo built with a GPU backend. Pick the one that matches your hardware:
Hardware |
CMake flag |
Executor class |
|---|---|---|
NVIDIA |
|
|
AMD |
|
|
Intel GPU |
|
|
The walk-through below uses CudaExecutor. To target AMD or Intel, swap the
executor type — every other line of code stays the same.
This tutorial also assumes you have the OpenMP backend (GINKGO_BUILD_OMP=ON)
so the host side can use multi-threaded execution for data assembly. The
ReferenceExecutor works too if OMP isn’t available; just substitute it
below.
See Install Ginkgo for the full build invocation.
The host/device executor pattern#
GPU code in Ginkgo uses two executors: a host executor for data assembly and a device executor for the heavy lifting:
auto host_exec = gko::OmpExecutor::create();
auto exec = gko::CudaExecutor::create(/*device_id=*/0, host_exec);
The CUDA executor needs a host executor as its second constructor argument — this is the executor used for fall-back operations and for asynchronous device-to-host transfers that the runtime issues internally. See CudaExecutor for the full constructor signatures (allocator and stream variants).
Three line changes from the CPU version#
Take the CPU poisson_cpu.cpp and apply these edits:
Create a device executor (and rename the CPU one):
auto host_exec = gko::OmpExecutor::create(); auto exec = gko::CudaExecutor::create(0, host_exec);
Build matrix and vectors on the host, then clone the matrix to the device when handing it to
generate. The right-hand side and solution vector are built on the host as well and transferred implicitly byapply:auto matrix = build_matrix(host_exec, K); auto rhs = build_rhs(host_exec, K, /*u0=*/0.0, /*u1=*/1.0); auto u = Dense::create(host_exec, gko::dim<2>(K, 1)); u->fill(0.0); auto solver = solver_factory->generate(gko::share(gko::clone(exec, matrix))); solver->apply(rhs, u); // rhs and u are pulled onto the device
gko::clone(exec, matrix)is the canonical cross-executor copy — it returns aunique_ptr<Csr>whose storage lives onexec’s device memory.gko::sharethen hands ownership over to the factory.Build the solver factory on the device executor:
auto solver_factory = Cg::build() .with_criteria(/* same as CPU */) .with_preconditioner(Jacobi::build()) .on(exec); // <-- device executor, was 'host_exec' before
That’s it. The math, the CG iteration, the preconditioner, the stopping
criteria — none of that needs to change. The kernels Ginkgo dispatches under
the hood swap automatically based on what exec is.
Reading the result back#
The CPU tutorial reads u with u->get_const_values() directly. That’s
fine when u lives on the host. In our GPU version u was built on
host_exec, but after apply its values may live on the device —
specifically, apply performs a cross-executor copy of the solution back
into u’s host storage if u is on the host, so in this exact pattern the
direct read works. Where the issue bites is if you build the result vector
on exec instead:
auto u = Dense::create(exec, gko::dim<2>(K, 1)); // device-resident
// ...
const auto* uv = u->get_const_values(); // <-- DEVICE pointer
double bad = uv[0]; // <-- segfault on host
The robust pattern is to clone to the host before reading:
auto u_host = gko::clone(host_exec, u); // device -> host copy
const auto* uv = u_host->get_const_values();
double good = uv[0]; // ok, host memory
For the tutorial we keep u on the host throughout, so the post-solve loop
is identical to the CPU version. The Cross-executor data movement
how-to covers the topic in
depth.
Full program#
poisson_gpu.cpp:
#include <cmath>
#include <fstream>
#include <iostream>
#include <vector>
#include <ginkgo/ginkgo.hpp>
using ValueType = double;
using IndexType = int;
using Csr = gko::matrix::Csr<ValueType, IndexType>;
using Dense = gko::matrix::Dense<ValueType>;
using Cg = gko::solver::Cg<ValueType>;
using Jacobi = gko::preconditioner::Jacobi<ValueType, IndexType>;
std::unique_ptr<Csr> build_matrix(std::shared_ptr<const gko::Executor> exec,
int K)
{
auto mat = Csr::create(exec, gko::dim<2>(K, K), 3 * K - 2);
auto row_ptrs = mat->get_row_ptrs();
auto col_idxs = mat->get_col_idxs();
auto values = mat->get_values();
int pos = 0;
row_ptrs[0] = 0;
for (int i = 0; i < K; ++i) {
if (i > 0) { col_idxs[pos] = i - 1; values[pos++] = -1.0; }
col_idxs[pos] = i; values[pos++] = 2.0;
if (i < K - 1) { col_idxs[pos] = i + 1; values[pos++] = -1.0; }
row_ptrs[i + 1] = pos;
}
return mat;
}
std::unique_ptr<Dense> build_rhs(std::shared_ptr<const gko::Executor> exec,
int K, double u0, double u1)
{
auto rhs = Dense::create(exec, gko::dim<2>(K, 1));
auto vals = rhs->get_values();
const double h = 1.0 / (K + 1);
for (int i = 0; i < K; ++i) {
const double xi = (i + 1) * h;
vals[i] = -6.0 * xi * h * h;
}
vals[0] += u0;
vals[K - 1] += u1;
return rhs;
}
int main(int argc, char* argv[])
{
const int K = (argc >= 2) ? std::atoi(argv[1]) : 1024;
auto host_exec = gko::OmpExecutor::create();
auto exec = gko::CudaExecutor::create(0, host_exec);
auto solver_factory =
Cg::build()
.with_criteria(
gko::stop::Iteration::build().with_max_iters(4u * K),
gko::stop::ResidualNorm<ValueType>::build()
.with_reduction_factor(1e-12))
.with_preconditioner(Jacobi::build())
.on(exec);
auto matrix = build_matrix(host_exec, K);
auto rhs = build_rhs(host_exec, K, /*u0=*/0.0, /*u1=*/1.0);
auto u = Dense::create(host_exec, gko::dim<2>(K, 1));
u->fill(0.0);
auto solver = solver_factory->generate(gko::share(gko::clone(exec, matrix)));
solver->apply(rhs, u);
double max_err = 0.0;
const double h = 1.0 / (K + 1);
const auto* uv = u->get_const_values();
for (int i = 0; i < K; ++i) {
const double xi = (i + 1) * h;
max_err = std::max(max_err, std::abs(uv[i] - xi * xi * xi));
}
std::cout << "K = " << K
<< ", max |u_h - u_exact| = " << max_err << '\n';
}
CMakeLists.txt is unchanged from the CPU tutorial — Ginkgo’s
Ginkgo::ginkgo target pulls in whichever backends were built:
cmake_minimum_required(VERSION 3.16)
project(poisson_gpu CXX)
find_package(Ginkgo REQUIRED)
add_executable(poisson_gpu poisson_gpu.cpp)
target_link_libraries(poisson_gpu PRIVATE Ginkgo::ginkgo)
Build and run#
cmake -B build
cmake --build build
./build/poisson_gpu 4096
Expected output (the numerical answer is identical to the CPU version up to round-off — only the path differs):
K = 4096, max |u_h - u_exact| = ~1e-13
The exact final digit is hardware-, driver-, and library-dependent — but matching the CPU answer to within \(10^{-13}\) confirms the GPU path is giving you the same solution.
When is the GPU actually faster?#
Wrap the solver_factory->generate + solver->apply block in a timer to
measure end-to-end solve time:
#include <chrono>
// ...
auto tic = std::chrono::steady_clock::now();
auto solver = solver_factory->generate(gko::share(gko::clone(exec, matrix)));
solver->apply(rhs, u);
exec->synchronize(); // make sure all device work has finished
auto toc = std::chrono::steady_clock::now();
std::cout << "solve time: "
<< std::chrono::duration<double, std::milli>(toc - tic).count()
<< " ms\n";
The exec->synchronize() is important — device kernels launched by apply
are asynchronous, so without it you’d be timing the kernel-launch overhead
rather than the actual work. See
CudaExecutor for the synchronization model.
Run the same code with gko::OmpExecutor::create() substituted for the
device executor and you have a CPU-vs-GPU sweep. Qualitative pattern (your
exact numbers will depend heavily on hardware):
\(K\) |
OMP (e.g. 16-core CPU) |
CUDA (e.g. discrete GPU) |
Faster? |
|---|---|---|---|
\(10^2\) |
sub-millisecond |
sub-millisecond, dominated by launch overhead |
CPU |
\(10^4\) |
a few milliseconds |
a few milliseconds |
tied |
\(10^6\) |
tens of milliseconds |
a few milliseconds |
GPU |
\(10^7\)+ |
hundreds of ms |
low tens of ms |
GPU |
The crossover sits around \(K \sim 10^4\)–\(10^5\) for a 1D problem because that’s where the device’s memory bandwidth advantage starts to dominate over kernel-launch overhead. For 2D / 3D problems the crossover shifts to larger total non-zeros — different problem shape, same principle.
If you want to dig into where each iteration spends its time, attach a
Performance logger and inspect the
per-kernel breakdown.
Common pitfalls#
Reading device-resident vectors on the host. The compiler will let you dereference a device pointer; the OS will not. Always
gko::clone(host_exec, vec)first.Forgetting
exec->synchronize()in benchmarking code. Kernel launches return immediately. Timings that omitsynchronize()measure launch latency, not execution.Building the factory on the wrong executor. The factory’s
.on(...)call binds every component (CG iterates, the preconditioner) to a single executor. If you accidentally call.on(host_exec), the solve will run on the CPU even though your matrix is on the GPU — Ginkgo will copy data back across to make it work, silently and slowly.Re-cloning across solves. If you solve repeatedly with the same matrix (e.g. inside a Newton iteration), clone once outside the loop and reuse the solver. Cloning is not free.
Switching to a different vendor#
The whole point of the executor abstraction is that swapping vendors is a one-line change. To run on AMD or Intel hardware, change the executor construction line:
auto exec = gko::HipExecutor::create(0, host_exec); // AMD
auto exec = gko::DpcppExecutor::create(0, host_exec); // Intel GPU / SYCL
The build needs the corresponding Ginkgo backend
(-DGINKGO_BUILD_HIP=ON, -DGINKGO_BUILD_SYCL=ON); the source code itself
is unchanged. See Execution backends and
the per-executor pages.
Next steps#
Poisson on CPU — the matching CPU walk-through if you arrived here first.
Concepts: Executor model — the bigger picture of how executors fit together.
How-To: Move data between executors — clone, copy_from,
make_temporary_clonepatterns.How-To: Switch executor — for applications that want to pick a backend at runtime.
See also
CudaExecutor — stream, allocator, and device-id options.
HipExecutor — the AMD equivalent.
DpcppExecutor — the Intel / SYCL equivalent.