Add a new preconditioner#
A preconditioner in Ginkgo is a LinOp with two extra entry points:
generate(system_matrix) which builds the operator from a source matrix,
and apply(b, x) which applies the resulting operator. There is no
workspace machinery — preconditioners do their work in a single
apply and have no iteration loop.
This page covers two patterns: a local preconditioner modelled on
preconditioner::Jacobi (include/ginkgo/core/preconditioner/jacobi.hpp,
core/preconditioner/jacobi.cpp), and a distributed preconditioner
modelled on experimental::distributed::preconditioner::Schwarz
(include/ginkgo/core/distributed/preconditioner/schwarz.hpp,
core/distributed/preconditioner/schwarz.cpp).
What you commit to implement#
File |
Local (Jacobi-style) |
Distributed (Schwarz-style) |
|---|---|---|
Public header |
|
|
Implementation |
|
|
Kernel header |
|
Usually reuses local kernels — no MPI in kernels |
Backend kernels |
|
Same |
|
Add |
Add under the |
|
Append to |
Append to |
|
Add to |
Same, but guarded |
Parse specialisation |
A line in |
A new file |
The local preconditioner skeleton#
template <typename ValueType = default_precision, typename IndexType = int32>
class MyPrecond
: public EnableLinOp<MyPrecond<ValueType, IndexType>>,
public Transposable {
friend class EnableLinOp<MyPrecond>;
friend class EnablePolymorphicObject<MyPrecond, LinOp>;
GKO_ASSERT_SUPPORTED_VALUE_AND_INDEX_TYPE;
public:
using value_type = ValueType;
using index_type = IndexType;
using transposed_type = MyPrecond<ValueType, IndexType>;
std::unique_ptr<LinOp> transpose() const override;
std::unique_ptr<LinOp> conj_transpose() const override;
GKO_CREATE_FACTORY_PARAMETERS(parameters, Factory)
{
// Whatever the preconditioner needs to know at generate time:
bool GKO_FACTORY_PARAMETER_SCALAR(skip_sorting, false);
uint32 GKO_FACTORY_PARAMETER_SCALAR(max_block_size, 32u);
// Sub-factories use the deferred variant:
std::shared_ptr<const LinOpFactory>
GKO_DEFERRED_FACTORY_PARAMETER(inner_factory);
};
GKO_ENABLE_LIN_OP_FACTORY(MyPrecond, parameters, Factory);
GKO_ENABLE_BUILD_METHOD(Factory);
static parameters_type parse(
const config::pnode& config,
const config::registry& context,
const config::type_descriptor& td_for_child =
config::make_type_descriptor<ValueType, IndexType>());
protected:
explicit MyPrecond(std::shared_ptr<const Executor> exec)
: EnableLinOp<MyPrecond>(std::move(exec)) {}
explicit MyPrecond(const Factory* factory,
std::shared_ptr<const LinOp> system_matrix)
: EnableLinOp<MyPrecond>(factory->get_executor(),
gko::transpose(system_matrix->get_size())),
parameters_{factory->get_parameters()}
{
this->generate(system_matrix.get(), parameters_.skip_sorting);
}
void apply_impl(const LinOp* b, LinOp* x) const override;
void apply_impl(const LinOp* alpha, const LinOp* b,
const LinOp* beta, LinOp* x) const override;
private:
void generate(const LinOp* system_matrix, bool skip_sorting);
// Whatever device-side state your generate() builds — e.g. a Csr
// of factor entries, an array of block pointers, ...:
array<value_type> data_;
};
Compared to a solver:
No
EnablePreconditionedIterativeSolverbase — preconditioners don’t iterate. The only base isEnableLinOp<Self>.No
workspace_traitsspecialisation — there’s no iteration scratch to cache.generate(...)runs from the second constructor (the one the factory uses). After it returns, theprivatestate holds everythingapplyneeds.
generate#
generate is preconditioner-specific. It typically:
Allocates the device-side storage for the preconditioner data (e.g. block diagonal blocks, ILU factor
LandUmatrices).Optionally sorts the system matrix (
skip_sortinglets callers skip this when they already know it’s sorted).Runs one or more kernels to populate that storage.
Use gko::clone(this->get_executor(), as<...>(system_matrix)) if you
need a typed view of the input on the right executor.
apply_impl#
apply_impl for a local preconditioner uses the non-distributed
precision dispatch and runs a kernel that consumes the precomputed
state:
template <typename ValueType, typename IndexType>
void MyPrecond<ValueType, IndexType>::apply_impl(const LinOp* b,
LinOp* x) const
{
precision_dispatch_real_complex<ValueType>(
[this](auto dense_b, auto dense_x) {
this->get_executor()->run(my_precond::make_apply(
/* preconditioner state */ data_,
dense_b->get_const_device_view(),
dense_x->get_device_view()));
},
b, x);
}
template <typename ValueType, typename IndexType>
void MyPrecond<ValueType, IndexType>::apply_impl(const LinOp* alpha,
const LinOp* b,
const LinOp* beta,
LinOp* x) const
{
precision_dispatch_real_complex<ValueType>(
[this](auto dense_alpha, auto dense_b,
auto dense_beta, auto dense_x) {
this->get_executor()->run(my_precond::make_apply(
dense_alpha->get_const_device_view(),
data_,
dense_b->get_const_device_view(),
dense_beta->get_const_device_view(),
dense_x->get_device_view()));
},
alpha, b, beta, x);
}
The distributed preconditioner skeleton#
template <typename ValueType = default_precision,
typename LocalIndexType = int32,
typename GlobalIndexType = int64>
class MyPrecond
: public EnableLinOp<MyPrecond<ValueType, LocalIndexType, GlobalIndexType>> {
friend class EnableLinOp<MyPrecond>;
friend class EnablePolymorphicObject<MyPrecond, LinOp>;
GKO_ASSERT_SUPPORTED_VALUE_AND_DIST_INDEX_TYPE;
public:
using value_type = ValueType;
using local_index_type = LocalIndexType;
using global_index_type = GlobalIndexType;
GKO_CREATE_FACTORY_PARAMETERS(parameters, Factory)
{
std::shared_ptr<const LinOpFactory>
GKO_DEFERRED_FACTORY_PARAMETER(local_solver);
};
GKO_ENABLE_LIN_OP_FACTORY(MyPrecond, parameters, Factory);
GKO_ENABLE_BUILD_METHOD(Factory);
static parameters_type parse(
const config::pnode& config,
const config::registry& context,
const config::type_descriptor& td_for_child =
config::make_type_descriptor<ValueType, LocalIndexType,
GlobalIndexType>());
protected:
explicit MyPrecond(std::shared_ptr<const Executor> exec)
: EnableLinOp<MyPrecond>(std::move(exec)) {}
explicit MyPrecond(const Factory* factory,
std::shared_ptr<const LinOp> system_matrix)
: EnableLinOp<MyPrecond>(factory->get_executor(),
gko::transpose(system_matrix->get_size())),
parameters_{factory->get_parameters()}
{
this->generate(std::move(system_matrix));
}
void apply_impl(const LinOp* b, LinOp* x) const override;
template <typename VectorType>
void apply_dense_impl(const VectorType* b, VectorType* x) const;
void apply_impl(const LinOp* alpha, const LinOp* b,
const LinOp* beta, LinOp* x) const override;
private:
void generate(std::shared_ptr<const LinOp> system_matrix);
std::shared_ptr<const LinOp> local_solver_;
detail::VectorCache<ValueType> cache_; // for advanced apply
};
Two differences from the local case:
The class is templated on three types (
V,L,G) and lives undergko::experimental::distributed::preconditioner. The header is wrapped in#if GINKGO_BUILD_MPIand the source is added tocore/CMakeLists.txtunder theif(GINKGO_BUILD_MPI)block.The static-assert is
GKO_ASSERT_SUPPORTED_VALUE_AND_DIST_INDEX_TYPE, which constrainsLocalIndexTypeandGlobalIndexTypeto the distributed-supported combinations.
The apply_impl for a purely local Schwarz-style preconditioner applies
the inner solver to the rank-local Dense slice and returns. The
Schwarz preconditioner does no MPI communication of its own:
template <typename ValueType, typename LocalIndexType, typename GlobalIndexType>
void MyPrecond<ValueType, LocalIndexType, GlobalIndexType>::apply_impl(
const LinOp* b, LinOp* x) const
{
precision_dispatch_real_complex_distributed<ValueType>(
[this](auto dense_b, auto dense_x) {
this->apply_dense_impl(dense_b, dense_x);
},
b, x);
}
template <typename ValueType, typename LocalIndexType, typename GlobalIndexType>
template <typename VectorType>
void MyPrecond<ValueType, LocalIndexType, GlobalIndexType>::apply_dense_impl(
const VectorType* dense_b, VectorType* dense_x) const
{
if (this->local_solver_) {
this->local_solver_->apply(gko::detail::get_local(dense_b),
gko::detail::get_local(dense_x));
}
}
gko::detail::get_local(v) is the helper from
core/distributed/helpers.hpp. It returns the rank-local
matrix::Dense<V>* for both distributed::Vector and ordinary
matrix::Dense, so the same iteration body covers serial and MPI use.
Where MPI can live#
A distributed preconditioner can do its own MPI communication, but not everywhere:
Allowed: core layer —
core/distributed/preconditioner/my_precond.cppand any host-side helpers it calls.Forbidden: kernel modules —
reference/,cuda/,hip/,omp/,dpcpp/,common/unified/,common/cuda_hip/must not includempi.hor touch the communicator.
The split is the same circular-dependency rule that applies to every
kernel module: backend kernels depend on core/, not the other way
around. The no-circular-deps CI job catches violations.
When communication is needed, structure apply_dense_impl as an
interleaving of kernel launches and MPI calls so MPI latency hides
behind on-device work:
Launch the kernel work that has no remote dependency.
Issue the non-blocking MPI call (
i_all_to_all_v,Isend/Irecv, …) from the host — returns immediately, kernel keeps running.Wait on the MPI request, or use a stream event (see Stream-MPI ordering for GPU executors).
Launch the kernel that consumes the received data.
The distributed Matrix::apply is the canonical example: diagonal
matrix kernel runs concurrently with the halo i_all_to_all_v, then
the off-diagonal kernel fires once the receive completes (see
Communication
primitives).
Replicate that structure rather than putting an MPI_Allreduce
between two synchronous kernel launches.
Config / JSON registration#
Add an entry to the LinOpFactoryType enum in
core/config/config_helper.hpp and a map entry to
generate_config_map() in core/config/registry.cpp (MPI-gate the
latter for a distributed preconditioner — see the Schwarz line as the
precedent). Then provide the parse specialisation:
Local, simple dispatch — one line in
core/config/preconditioner_config.cppnext to the Jacobi entry, usingGKO_PARSE_VALUE_AND_INDEX_TYPE(orGKO_PARSE_VALUE_TYPEfor a value-only preconditioner).Local, custom dispatch — its own file, mirroring
preconditioner_ilu_config.cpporpreconditioner_isai_config.cpp.Distributed — always its own file under the
if(GINKGO_BUILD_MPI)block ofcore/CMakeLists.txt. Copycore/config/schwarz_config.cpp; the hand-writtenparse<LinOpFactoryType::Schwarz>handles the(LocalIndexType, GlobalIndexType)dispatch and skips invalid combinations like(int64, int32).
Instantiation#
End the .cpp with the explicit-instantiation block. For a local
preconditioner:
#define GKO_DECLARE_MY_PRECOND(ValueType, IndexType) \
class MyPrecond<ValueType, IndexType>
GKO_INSTANTIATE_FOR_EACH_VALUE_AND_INDEX_TYPE(GKO_DECLARE_MY_PRECOND);
For a distributed preconditioner:
#define GKO_DECLARE_MY_PRECOND(V, L, G) class MyPrecond<V, L, G>
GKO_INSTANTIATE_FOR_EACH_VALUE_AND_LOCAL_GLOBAL_INDEX_TYPE(GKO_DECLARE_MY_PRECOND);
Reference-parity test#
The full test plan is covered in Write tests. For a new local preconditioner you write three files:
core/test/preconditioner/my_precond.cpp— pure-C++ tests on the factory:parameters_typedefaults,with_*setters,build()accepting a system matrix,transpose()returning a valid object. These don’t need a backend or apply. Registered withginkgo_create_test(my_precond)incore/test/preconditioner/CMakeLists.txt.reference/test/preconditioner/my_precond_kernels.cpp— Reference-only correctness tests on small hand-checkable inputs. Registered withginkgo_create_test(my_precond_kernels).test/preconditioner/my_precond_kernels.cpp— the cross-backend parity test. Same source file compiled once per enabled backend (omp,cuda,hip,dpcpp) byginkgo_create_common_test, usingCommonTestFixture(fromtest/utils/common_fixture.hpp) which provides bothref(Reference) andexec(the target backend). Each test sets up the same input on both, applies the preconditioner, and asserts agreement withGKO_ASSERT_MTX_NEAR(...). Registered withginkgo_create_common_test(my_precond_kernels)intest/preconditioner/CMakeLists.txt.
jacobi*.cpp in those three trees is the template to copy.
For a new distributed preconditioner, the cross-backend parity
test lives at test/mpi/preconditioner/my_precond.cpp and is
registered with ginkgo_create_common_and_reference_test(my_precond MPI_SIZE 3 LABELS distributed). That helper stamps out one test per
enabled backend plus a Reference variant, all run under MPI on the
requested rank count. test/mpi/preconditioner/schwarz.cpp is the
template.
See also
Add a new solver — the iterative pattern, with workspace machinery.
Distributed solvers and preconditioners — what the distributed preconditioner composes with.
Configure via JSON — how config edits surface in user-facing JSON.