Use a batched solver#

Ginkgo’s batched solvers solve many small independent linear systems in a single fused kernel, with no inter-batch communication. The price is the uniform-batch constraint: every system must have the same dimensions and the same sparsity pattern. When that holds, batched solvers significantly outperform a loop over single-system solves.

The recipe#

#include <ginkgo/ginkgo.hpp>

namespace b = gko::batch;

const gko::size_type N = 100;     // number of batch items
const int n = 256, nnz = 1024;    // rows and non-zeros per item

auto exec = gko::CudaExecutor::create(0, gko::OmpExecutor::create());

// Build the batched matrix from three arrays:
//    - values:   N * nnz entries, item k at offset k * nnz
//    - col_idxs: nnz entries, shared by every item
//    - row_ptrs: n + 1 entries, shared by every item
gko::array<double> values  {exec, N * nnz};
gko::array<int>    col_idxs{exec, nnz};
gko::array<int>    row_ptrs{exec, n + 1};
// ... fill from your application data ...

auto A = gko::share(b::matrix::Csr<double, int>::create(
    exec,
    gko::batch_dim<2>{N, gko::dim<2>{n, n}},
    std::move(values), std::move(col_idxs), std::move(row_ptrs)));

// Build the right-hand side and solution multi-vectors.
gko::batch_dim<2> vec_dim{N, gko::dim<2>{n, 1}};
auto rhs = b::MultiVector<double>::create(exec, vec_dim);
auto sol = b::MultiVector<double>::create(exec, vec_dim);
// ... fill rhs (per-item values laid out the same way as A's values) ...

// Configure and apply a batched CG.
auto solver = b::solver::Cg<double>::build()
    .with_max_iterations(100)
    .with_tolerance(1e-8)
    .with_tolerance_type(b::stop::tolerance_type::absolute)
    .on(exec)
    ->generate(A);

solver->apply(rhs, sol);

After apply, sol holds all N solutions; each item lives at offset k * n * 1 in the multi-vector’s value buffer.

Pick the solver#

Class

When to use

batch::solver::Cg<V>

Each item is SPD.

batch::solver::Bicgstab<V>

Each item is general non-symmetric.

Both accept a preconditioner via with_preconditioner — currently batch::preconditioner::Jacobi is the only batched preconditioner.

Read per-item convergence#

batch::log::BatchConvergence captures the final iteration count and residual norm per item:

auto conv = b::log::BatchConvergence<double>::create();
solver->add_logger(conv);
solver->apply(rhs, sol);

// Copy to host for inspection. get_num_iterations() returns gko::array<int>
// on the solver's executor.
auto iters_host = gko::array<int>{exec->get_master(),
                                  conv->get_num_iterations()};
auto resn_host  = gko::array<double>{exec->get_master(),
                                     conv->get_residual_norm()};

for (gko::size_type k = 0; k < N; ++k) {
    std::cout << "item " << k
              << " iters=" << iters_host.get_const_data()[k]
              << " ||r||=" << resn_host.get_const_data()[k] << '\n';
}

The arrays are sized on first completion and reused on subsequent applies of the same batch size.

Wrap application-owned memory#

For zero-copy from a buffer your application already owns, use create_const:

auto A = b::matrix::Csr<double, int>::create_const(
    exec, mat_size,
    gko::make_const_array_view(exec, N * nnz, values_ptr),
    gko::make_const_array_view(exec, nnz,     col_idxs_ptr),
    gko::make_const_array_view(exec, n + 1,   row_ptrs_ptr));

The application is responsible for keeping the buffers alive while the batched matrix is in use.

Common pitfalls#

  • Uniform-batch constraint. Items with different sparsity patterns or different nnz per item are not supported. If your patterns differ slightly, pad each item to the union of patterns.

  • Dense input is not gko::matrix::Dense. Batched RHS / solution vectors are batch::MultiVector<V>, a distinct type. Mixing the two does not compile.

  • No per-iteration logging. The batched path runs entirely inside the kernel; per-iteration host callbacks would defeat the fused design. Only completion is logged.

  • CPU backends. The fused-kernel design assumes a GPU target. Batched solvers compile for OpenMP and Reference but the speedup over single-system solves is much smaller there.

See also