RowGatherer and RowScatterer#
Distributed sparse-matrix kernels need two symmetric pieces of MPI bookkeeping. To compute \(y = A x\) where \(A\) is row-distributed, every rank must pull in the entries of \(x\) that its off-diagonal matrix references but does not own — a halo gather. To compute \(y = A^T x\) (or any operation that builds \(y\) from contributions on multiple ranks), every rank must push out locally-computed contributions to the ranks that own the corresponding rows of \(y\), where they are accumulated — a halo scatter.
RowGatherer and RowScatterer are the two helper objects that encapsulate these patterns. Both
sit on top of a collective communicator and reuse a precomputed
index pattern across many apply calls so that the per-call cost is just the all-to-all itself.
They are inverses of each other: the scatter pattern is the gather pattern with send and receive
roles swapped, which is exactly what CollectiveCommunicator::create_inverse() produces.
Most user code does not interact with these helpers directly — the distributed Matrix constructs
and uses them internally. They are exposed publicly for users implementing custom distributed
LinOp operations that need to drive their own asynchronous halo exchanges.
RowGatherer#
gko::experimental::distributed::RowGatherer<L> is the helper that drives the halo exchange in
distributed Matrix::apply. It packs the rows of a distributed Vector that other ranks need
into a contiguous buffer, fires the all-to-all through a
CollectiveCommunicator, and writes the result into a destination
vector.
What it does#
The mathematical operation is the row-selection map \(x_\text{halo} = R_\text{halo}\, x\), where
\(R_\text{halo}\) is the indicator that picks out the global rows of \(x\) that this rank’s
off-diagonal matrix needs. In distributed terms each rank owns a contiguous block of \(x\), so satisfying
\(x_\text{halo}\) requires other ranks to send their owned values to this rank — i.e. an
all-to-all. RowGatherer is the object that owns the index list (“which local rows of my \(x\)
do I need to send out, and to whom”) plus the send-side buffer.
Construction#
The recommended constructor takes an executor, a collective communicator, and an IndexMap:
namespace dist = gko::experimental::distributed;
namespace mpi = gko::experimental::mpi;
// Build a collective communicator from the matrix's index map.
auto coll_comm = std::make_shared<mpi::NeighborhoodCommunicator>(comm, imap);
// Construct the row gatherer.
auto rg = dist::RowGatherer<gko::int32>::create(exec, coll_comm, imap);
Internally the constructor calls coll_comm->create_inverse() and runs a single
i_all_to_all_v to derive the local send indices from the remote recv indices encoded in
the index map. The send-index array is cached on the executor and reused for every subsequent
exchange.
A simpler RowGatherer::create(exec, comm) constructor is available for cases where you only have
a base communicator. It builds an empty default CollectiveCommunicator (currently
DenseCommunicator) and is mainly useful as a default-constructed handle to be assigned into.
The asynchronous apply#
auto b = dist::Vector<double>::create(exec, comm, ...);
auto x = gko::matrix::Dense<double>::create(exec, ...);
auto req = rg->apply_async(b, x);
// ... do unrelated work that does not touch x ...
req.wait();
// x now holds the gathered rows.
apply_async returns an mpi::request. The receive buffer (x) must not be read or written
between the call and req.wait(), and the sender’s local data inside b must not be modified
either. Once wait() returns, x holds exactly get_size()[0] rows — the rows of b that the
collective communicator’s pattern said this rank needs.
Warning
The two-argument apply_async uses the gatherer’s internal send buffer, so only one mpi::request
per RowGatherer instance can be active at a time. Calling it again before wait()-ing on the
previous request is undefined behaviour. To overlap multiple exchanges on the same gatherer, use
the three-argument overload and pass a separate GenericDenseCache workspace per in-flight call.
The output executor must be compatible with the MPI implementation. If the MPI is not
GPU-aware and the executor is a CudaExecutor / HipExecutor / DpcppExecutor, the call throws
unless the destination’s executor is the host executor. Internally, RowGatherer uses
mpi::requires_host_buffer(exec, comm) to decide whether to stage through host memory.
Inspection accessors#
RowGatherer exposes a small read-only surface for inspecting the cached pattern:
Method |
Returns |
|---|---|
|
|
|
The shared |
|
Pointer to the local row indices being sent |
|
Length of the send-index array |
The send-index array is stored on the same executor the row gatherer lives on, so reading it from
the host requires gko::clone(host, ...) first.
How distributed Matrix::apply uses it#
The matrix holds a RowGatherer per IndexMap. The 6-step apply trace
(Communication primitives) calls into it as follows:
rg->apply_prepare(b)packs the values that need to be sent into the workspace buffer.The diagonal-matrix kernel runs in parallel.
rg->apply_finalize(b, x_halo, prepare_event)issues the all-to-all and returns thempi::request.req.wait()—x_halois now populated.The off-diagonal-matrix kernel applies to
x_halo.The diagonal-matrix result and the off-diagonal-matrix result are summed into
y_local.
These two protected entry points (apply_prepare / apply_finalize) are exposed via the
detail friend interface for testing and for advanced users who want fine-grained control over
when the exchange is launched relative to other work.
RowScatterer#
Placeholder for future.
See also
Communication primitives — the SpMV pipeline that consumes the row gatherer.
Collective communicators — the layer beneath the gatherer.
Partitions and index maps — the bookkeeping that defines which rows get gathered.
API reference:
gko::experimental::distributed::RowGatherer