Build a mixed-precision kernel with accessors#

When a kernel is bandwidth-bound, moving fewer bytes is the most direct speedup available. Ginkgo’s accessor sub-system lets you store data at one precision and compute on it at another — the kernel reads and writes through a range, the storage cast happens transparently on every access. This recipe walks through wiring a mixed-precision AXPY ( y α x + y ) where x lives in float memory and y lives in double memory, with all arithmetic carried out in double.

The same template instantiates for any (storage, arithmetic) pair: swap two type aliases and the kernel body is unchanged. For more realistic kernels written against the same building blocks — GEMV, dot, TRSV with measured accuracy and bandwidth on A100 / V100 — see the companion accessor-BLAS repository.

Tip

A runnable bundle of the snippets on this page — main.cpp, kernel.cu, a CMakeLists.txt that detects CUDA, and a README.md with build / run instructions — is available as a single archive: mixed-precision-accessors.zip.

Step 1: Put the accessor headers on your include path#

The accessor sub-system lives under accessor/ in the Ginkgo source tree. It is header-only and depends on nothing else — neither libginkgo nor any installed include directory. To use it from a CMake project, add the source-tree accessor/ directory to your target’s include path:

add_executable(my_kernel my_kernel.cpp)
target_include_directories(my_kernel PRIVATE ${GINKGO_SOURCE_DIR})
target_compile_features(my_kernel PRIVATE cxx_std_17)

${GINKGO_SOURCE_DIR} is the root of the Ginkgo checkout — the directory containing accessor/, core/, include/, etc. Includes in your source then take the form <accessor/range.hpp>. C++17 is required because the accessor headers use std::void_t.

Step 2: Wrap your storage in ranges#

Build one gko::acc::range<reduced_row_major<...>> per input and output buffer. Vectors are expressed as N×1 ranges so the same code also handles strided access:

#include <array>
#include <cstddef>
#include <cstdint>
#include <vector>

#include <accessor/range.hpp>
#include <accessor/reduced_row_major.hpp>

constexpr std::size_t n = 1024;

std::vector<float>  x_storage(n, 1.0f);   // input  — float storage
std::vector<double> y_storage(n, 0.0);    // output — double storage

using x_acc = gko::acc::reduced_row_major</*Dim=*/2,
                                          /*ArithmeticType=*/double,
                                          /*StorageType=*/float>;
using y_acc = gko::acc::reduced_row_major</*Dim=*/2,
                                          /*ArithmeticType=*/double,
                                          /*StorageType=*/double>;

// Sizes and strides are passed as std::array, not braced initializer
// lists: template argument deduction on the range constructor doesn't
// see through brace-init shortcuts.
using sz = gko::acc::size_type;
std::array<sz, 2> size{n, 1};
std::array<sz, 1> stride{1};

auto x = gko::acc::range<x_acc>(size, x_storage.data(), stride);
auto y = gko::acc::range<y_acc>(size, y_storage.data(), stride);

The range constructor takes the logical {rows, cols} size, the typed storage pointer, and a stride of length dimensionality 1 (here just {1} for unit-stride vectors). The arithmetic type is what the kernel will see at every read; the storage type controls the actual memory layout and the cast applied on every load and store.

Step 3: Write a kernel templated on the range type#

The kernel never names a concrete storage type. It asks the range for its arithmetic_type and computes uniformly in that type:

template <typename XRange, typename YRange>
void mixed_axpy(typename YRange::accessor::arithmetic_type alpha,
                XRange x, YRange y)
{
    using ar_type = typename YRange::accessor::arithmetic_type;
    static_assert(std::is_same<
        ar_type, typename XRange::accessor::arithmetic_type>::value,
        "x and y must share an arithmetic type");

    for (std::int64_t i = 0; i < x.length(0); ++i) {
        const ar_type xi = x(i, 0);   // load:  float  → double
        const ar_type yi = y(i, 0);   // load:  double → double
        y(i, 0) = alpha * xi + yi;    // store: double → double
    }
}

Three things are doing the work here:

  • XRange and YRange are template parameters, so the kernel body has no knowledge of the underlying precisions.

  • Every x(i, 0) read is a float double upcast generated by the reduced accessor’s reference proxy — no manual conversion code.

  • The write to y(i, 0) goes through the same proxy in the opposite direction; for y_acc above that is double double (a no-op), but switching the storage to float would make it a downcast with no kernel change.

Step 4: Call the kernel#

Once the ranges and the kernel are in place, invocation is a single line:

mixed_axpy(2.5, x, y);   // y[i] += 2.5 * x[i], computed in double

Swapping precisions is a localised edit. To run the same kernel with both vectors in float storage:

using x_acc = gko::acc::reduced_row_major<2, double, float>;
using y_acc = gko::acc::reduced_row_major<2, double, float>;

To compute in float instead of double, change the arithmetic type on both aliases. The kernel body, the call site, and the surrounding algorithm stay identical — only the storage and arithmetic tags move.

Step 5: Same kernel on a CUDA device#

The kernel body needs no changes to run on the GPU — only the launch glue around it does. Mark it __global__, compute a thread index, and launch from a host function with the same ranges built over device pointers:

#include <cuda_runtime.h>

#include <accessor/cuda_helper.hpp>     // adds as_cuda_range / as_cuda_type
#include <accessor/range.hpp>
#include <accessor/reduced_row_major.hpp>

template <typename XRange, typename YRange>
__global__ void mixed_axpy_cuda(
    typename YRange::accessor::arithmetic_type alpha, XRange x, YRange y)
{
    using ar_type = typename YRange::accessor::arithmetic_type;
    const std::int64_t i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < x.length(0)) {
        const ar_type xi = x(i, 0);
        const ar_type yi = y(i, 0);
        y(i, 0) = alpha * xi + yi;
    }
}

void run_on_cuda(std::size_t n, float* d_x, double* d_y, double alpha)
{
    using sz = gko::acc::size_type;
    using x_acc = gko::acc::reduced_row_major<2, double, float>;
    using y_acc = gko::acc::reduced_row_major<2, double, double>;

    std::array<sz, 2> size{static_cast<sz>(n), 1};
    std::array<sz, 1> stride{1};

    auto x = gko::acc::range<x_acc>(size, d_x, stride);
    auto y = gko::acc::range<y_acc>(size, d_y, stride);

    constexpr int block_size = 256;
    const int grid_size = static_cast<int>((n + block_size - 1) / block_size);

    mixed_axpy_cuda<<<grid_size, block_size>>>(
        gko::acc::as_cuda_type(alpha),
        gko::acc::as_cuda_range(x),
        gko::acc::as_cuda_range(y));
}

Two CUDA-specific helpers appear in the launch:

  • as_cuda_range(r) rewrites the range’s storage type to the CUDA equivalent — a no-op for float / double, but the path that routes gko::half storage to __half and gko::bfloat16 to __nv_bfloat16. Using it unconditionally keeps the launch site precision-agnostic.

  • as_cuda_type(scalar) does the same for the alpha scalar.

d_x and d_y are plain device pointers — allocated with cudaMalloc, populated via cudaMemcpy or another kernel. The range constructor does not care that the pointer lives in device memory; it only matters that the eventual operator() call runs on the matching side.

On the build side, compile the translation unit as CUDA and link in the CUDA runtime:

enable_language(CUDA)

add_executable(my_kernel my_kernel.cu)
target_include_directories(my_kernel PRIVATE ${GINKGO_SOURCE_DIR})
target_compile_features(my_kernel PRIVATE cxx_std_17)
set_target_properties(my_kernel PROPERTIES CUDA_STANDARD 17)

# The accessor's constexpr index helpers are host-only by default;
# tell nvcc they are usable from device code too.
target_compile_options(my_kernel PRIVATE
    $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>)

The HIP and SYCL paths follow the same shape — include <accessor/hip_helper.hpp> and use as_hip_range / as_hip_type, or include <accessor/sycl_helper.hpp> and submit the kernel to a SYCL queue. The kernel body is unchanged across all three. acc_gemv and acc_dot in accessor-BLAS are the canonical CUDA references for this pattern with reductions and shared-memory scratch.

Step 6: Use scaled storage when the dynamic range is too wide#

reduced_row_major works well as long as the values you store fit in the narrow representation directly — float-stored double data, for example, is fine because float covers most of double’s useful range. Push the storage to int8_t (or int16_t, half-precision, …) and the picture changes: the storage type covers a tiny fraction of the original range, and naive narrowing saturates or underflows.

scaled_reduced_row_major recovers the range by carrying a second buffer of scaling factors. Every read returns scale * stored_value cast to the arithmetic type; every write divides the incoming value by scale before casting it down to storage. A scalar mask picks which dimensions the scalar varies over — 0b00 for a single global scale, 0b10 for one scale per row of a 2D buffer, and so on (the LSB tracks the innermost index).

Here is the same AXPY with x stored as int8_t plus a single global scaling factor:

#include <accessor/range.hpp>
#include <accessor/scaled_reduced_row_major.hpp>

constexpr std::uint64_t single_scale = 0b00;   // one scalar for the whole range

using x_acc = gko::acc::scaled_reduced_row_major</*Dim=*/2,
                                                 /*ArithmeticType=*/double,
                                                 /*StorageType=*/std::int8_t,
                                                 single_scale>;

std::vector<std::int8_t> x_storage(n);
double                   x_scale = 1.0;

// Initialise the scale so that |x|_max maps to int8_t's range [-127, 127].
// Pick the scale once, before any writes — the kernel divides by it on
// store and multiplies by it on load.
x_scale = x_max_abs / 127.0;

auto x = gko::acc::range<x_acc>(size, x_storage.data(),
                                /*storage_stride=*/stride,
                                &x_scale);

// y_acc unchanged — plain double-stored output.
auto y = gko::acc::range<y_acc>(size, y_storage.data(), stride);

mixed_axpy(2.5, x, y);   // reads route through scale, writes divide it out

The kernel body in Step 3 is unchanged. x(i, 0) now returns static_cast<double>(x_storage[i]) * x_scale, and any write to x would silently apply x_storage[i] = static_cast<int8_t>(value / x_scale). For a per-row variant, switch the mask to 0b10, allocate n_rows scalars, and pass that pointer as the scalar argument; the accessor will pick the right scale for each row automatically.

The store path is a C++ floating-to-integer conversion, which the standard defines as truncation toward zero — not round-to-nearest. The worst-case round-trip error per element is therefore one full x_scale, not half of one. If half-LSB accuracy matters, wrap the write with an explicit std::round or pick a scale that accommodates the looser bound.

Going further#

See also

  • Accessors — what accessors are and why the storage / arithmetic split exists.

  • Mixed precision — how the solver layer surfaces the same trick.

  • accessor-BLAS — full GEMV / dot / TRSV kernels with accuracy and bandwidth measurements.