Build a mixed-precision kernel with accessors#
When a kernel is bandwidth-bound, moving fewer bytes is the most
direct speedup available. Ginkgo’s accessor sub-system lets you store
data at one precision and compute on it at another — the kernel reads
and writes through a range, the storage cast happens transparently on
every access. This recipe walks through wiring a mixed-precision
AXPY ( y ← α x + y ) where x lives in float memory and y
lives in double memory, with all arithmetic carried out in double.
The same template instantiates for any (storage, arithmetic) pair:
swap two type aliases and the kernel body is unchanged. For more
realistic kernels written against the same building blocks — GEMV, dot,
TRSV with measured accuracy and bandwidth on A100 / V100 — see the
companion accessor-BLAS repository.
Tip
A runnable bundle of the snippets on this page — main.cpp,
kernel.cu, a CMakeLists.txt that detects CUDA, and a README.md
with build / run instructions — is available as a single archive:
mixed-precision-accessors.zip.
Step 1: Put the accessor headers on your include path#
The accessor sub-system lives under accessor/ in the Ginkgo source
tree. It is header-only and depends on nothing else — neither
libginkgo nor any installed include directory. To use it from a
CMake project, add the source-tree accessor/ directory to your
target’s include path:
add_executable(my_kernel my_kernel.cpp)
target_include_directories(my_kernel PRIVATE ${GINKGO_SOURCE_DIR})
target_compile_features(my_kernel PRIVATE cxx_std_17)
${GINKGO_SOURCE_DIR} is the root of the Ginkgo checkout — the
directory containing accessor/, core/, include/, etc. Includes
in your source then take the form <accessor/range.hpp>. C++17 is
required because the accessor headers use std::void_t.
Step 2: Wrap your storage in ranges#
Build one gko::acc::range<reduced_row_major<...>> per input and
output buffer. Vectors are expressed as N×1 ranges so the same code
also handles strided access:
#include <array>
#include <cstddef>
#include <cstdint>
#include <vector>
#include <accessor/range.hpp>
#include <accessor/reduced_row_major.hpp>
constexpr std::size_t n = 1024;
std::vector<float> x_storage(n, 1.0f); // input — float storage
std::vector<double> y_storage(n, 0.0); // output — double storage
using x_acc = gko::acc::reduced_row_major</*Dim=*/2,
/*ArithmeticType=*/double,
/*StorageType=*/float>;
using y_acc = gko::acc::reduced_row_major</*Dim=*/2,
/*ArithmeticType=*/double,
/*StorageType=*/double>;
// Sizes and strides are passed as std::array, not braced initializer
// lists: template argument deduction on the range constructor doesn't
// see through brace-init shortcuts.
using sz = gko::acc::size_type;
std::array<sz, 2> size{n, 1};
std::array<sz, 1> stride{1};
auto x = gko::acc::range<x_acc>(size, x_storage.data(), stride);
auto y = gko::acc::range<y_acc>(size, y_storage.data(), stride);
The range constructor takes the logical {rows, cols} size, the
typed storage pointer, and a stride of length dimensionality − 1
(here just {1} for unit-stride vectors). The arithmetic type is what
the kernel will see at every read; the storage type controls the
actual memory layout and the cast applied on every load and store.
Step 3: Write a kernel templated on the range type#
The kernel never names a concrete storage type. It asks the range for
its arithmetic_type and computes uniformly in that type:
template <typename XRange, typename YRange>
void mixed_axpy(typename YRange::accessor::arithmetic_type alpha,
XRange x, YRange y)
{
using ar_type = typename YRange::accessor::arithmetic_type;
static_assert(std::is_same<
ar_type, typename XRange::accessor::arithmetic_type>::value,
"x and y must share an arithmetic type");
for (std::int64_t i = 0; i < x.length(0); ++i) {
const ar_type xi = x(i, 0); // load: float → double
const ar_type yi = y(i, 0); // load: double → double
y(i, 0) = alpha * xi + yi; // store: double → double
}
}
Three things are doing the work here:
XRangeandYRangeare template parameters, so the kernel body has no knowledge of the underlying precisions.Every
x(i, 0)read is afloat → doubleupcast generated by the reduced accessor’s reference proxy — no manual conversion code.The write to
y(i, 0)goes through the same proxy in the opposite direction; fory_accabove that isdouble → double(a no-op), but switching the storage tofloatwould make it a downcast with no kernel change.
Step 4: Call the kernel#
Once the ranges and the kernel are in place, invocation is a single line:
mixed_axpy(2.5, x, y); // y[i] += 2.5 * x[i], computed in double
Swapping precisions is a localised edit. To run the same kernel with
both vectors in float storage:
using x_acc = gko::acc::reduced_row_major<2, double, float>;
using y_acc = gko::acc::reduced_row_major<2, double, float>;
To compute in float instead of double, change the arithmetic type
on both aliases. The kernel body, the call site, and the surrounding
algorithm stay identical — only the storage and arithmetic tags move.
Step 5: Same kernel on a CUDA device#
The kernel body needs no changes to run on the GPU — only the launch
glue around it does. Mark it __global__, compute a thread index, and
launch from a host function with the same ranges built over device
pointers:
#include <cuda_runtime.h>
#include <accessor/cuda_helper.hpp> // adds as_cuda_range / as_cuda_type
#include <accessor/range.hpp>
#include <accessor/reduced_row_major.hpp>
template <typename XRange, typename YRange>
__global__ void mixed_axpy_cuda(
typename YRange::accessor::arithmetic_type alpha, XRange x, YRange y)
{
using ar_type = typename YRange::accessor::arithmetic_type;
const std::int64_t i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < x.length(0)) {
const ar_type xi = x(i, 0);
const ar_type yi = y(i, 0);
y(i, 0) = alpha * xi + yi;
}
}
void run_on_cuda(std::size_t n, float* d_x, double* d_y, double alpha)
{
using sz = gko::acc::size_type;
using x_acc = gko::acc::reduced_row_major<2, double, float>;
using y_acc = gko::acc::reduced_row_major<2, double, double>;
std::array<sz, 2> size{static_cast<sz>(n), 1};
std::array<sz, 1> stride{1};
auto x = gko::acc::range<x_acc>(size, d_x, stride);
auto y = gko::acc::range<y_acc>(size, d_y, stride);
constexpr int block_size = 256;
const int grid_size = static_cast<int>((n + block_size - 1) / block_size);
mixed_axpy_cuda<<<grid_size, block_size>>>(
gko::acc::as_cuda_type(alpha),
gko::acc::as_cuda_range(x),
gko::acc::as_cuda_range(y));
}
Two CUDA-specific helpers appear in the launch:
as_cuda_range(r)rewrites the range’s storage type to the CUDA equivalent — a no-op forfloat/double, but the path that routesgko::halfstorage to__halfandgko::bfloat16to__nv_bfloat16. Using it unconditionally keeps the launch site precision-agnostic.as_cuda_type(scalar)does the same for thealphascalar.
d_x and d_y are plain device pointers — allocated with cudaMalloc,
populated via cudaMemcpy or another kernel. The range constructor
does not care that the pointer lives in device memory; it only matters
that the eventual operator() call runs on the matching side.
On the build side, compile the translation unit as CUDA and link in the CUDA runtime:
enable_language(CUDA)
add_executable(my_kernel my_kernel.cu)
target_include_directories(my_kernel PRIVATE ${GINKGO_SOURCE_DIR})
target_compile_features(my_kernel PRIVATE cxx_std_17)
set_target_properties(my_kernel PROPERTIES CUDA_STANDARD 17)
# The accessor's constexpr index helpers are host-only by default;
# tell nvcc they are usable from device code too.
target_compile_options(my_kernel PRIVATE
$<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)
The HIP and SYCL paths follow the same shape — include
<accessor/hip_helper.hpp> and use as_hip_range / as_hip_type, or
include <accessor/sycl_helper.hpp> and submit the kernel to a SYCL
queue. The kernel body is unchanged across all three. acc_gemv and
acc_dot in accessor-BLAS are the canonical CUDA
references for this pattern with reductions and shared-memory scratch.
Step 6: Use scaled storage when the dynamic range is too wide#
reduced_row_major works well as long as the values you store fit in
the narrow representation directly — float-stored double data, for
example, is fine because float covers most of double’s useful
range. Push the storage to int8_t (or int16_t, half-precision, …)
and the picture changes: the storage type covers a tiny fraction of
the original range, and naive narrowing saturates or underflows.
scaled_reduced_row_major recovers the range by carrying a second
buffer of scaling factors. Every read returns scale * stored_value
cast to the arithmetic type; every write divides the incoming value by
scale before casting it down to storage. A scalar mask picks which
dimensions the scalar varies over — 0b00 for a single global scale,
0b10 for one scale per row of a 2D buffer, and so on (the LSB
tracks the innermost index).
Here is the same AXPY with x stored as int8_t plus a single global
scaling factor:
#include <accessor/range.hpp>
#include <accessor/scaled_reduced_row_major.hpp>
constexpr std::uint64_t single_scale = 0b00; // one scalar for the whole range
using x_acc = gko::acc::scaled_reduced_row_major</*Dim=*/2,
/*ArithmeticType=*/double,
/*StorageType=*/std::int8_t,
single_scale>;
std::vector<std::int8_t> x_storage(n);
double x_scale = 1.0;
// Initialise the scale so that |x|_max maps to int8_t's range [-127, 127].
// Pick the scale once, before any writes — the kernel divides by it on
// store and multiplies by it on load.
x_scale = x_max_abs / 127.0;
auto x = gko::acc::range<x_acc>(size, x_storage.data(),
/*storage_stride=*/stride,
&x_scale);
// y_acc unchanged — plain double-stored output.
auto y = gko::acc::range<y_acc>(size, y_storage.data(), stride);
mixed_axpy(2.5, x, y); // reads route through scale, writes divide it out
The kernel body in Step 3 is unchanged. x(i, 0) now returns
static_cast<double>(x_storage[i]) * x_scale, and any write to x
would silently apply x_storage[i] = static_cast<int8_t>(value / x_scale).
For a per-row variant, switch the mask to 0b10, allocate n_rows
scalars, and pass that pointer as the scalar argument; the accessor
will pick the right scale for each row automatically.
The store path is a C++ floating-to-integer conversion, which the
standard defines as truncation toward zero — not round-to-nearest.
The worst-case round-trip error per element is therefore one full
x_scale, not half of one. If half-LSB accuracy matters, wrap the
write with an explicit std::round or pick a scale that
accommodates the looser bound.
Going further#
Using mixed precision through Ginkgo solvers. If you want to consume mixed precision rather than write the kernel — combining a matrix, solver, and preconditioner at different precisions — see Mixed and multi-precision solvers and preconditioners, with the design rationale in Mixed precision.
See also
Accessors — what accessors are and why the storage / arithmetic split exists.
Mixed precision — how the solver layer surfaces the same trick.
accessor-BLAS — full GEMV / dot / TRSV kernels with accuracy and bandwidth measurements.