Communication primitives#

A distributed Matrix×Vector apply is the unit of communication in a distributed Ginkgo solver. This page explains how that operation decomposes into local work and halo exchange, where MPI happens in the call stack, and the stream-MPI ordering rule for GPU executors.

Anatomy of a distributed Matrix×Vector#

The operation y = A * x on distributed objects decomposes into two parts on each rank:

y_local = A_diag     * x_local     ← purely local work
        + A_off_diag * x_halo      ← needs values from other ranks

A_diag is the diagonal matrix: entries whose row is owned by this rank and whose column is also owned by this rank. Sized \(n_\text{local rows} \times n_\text{local cols}\), both axes locally indexed.
A_off_diag is the off-diagonal matrix: entries whose row is owned by this rank but whose column is owned by some other rank. Stored compressed — only the remote columns this rank’s local rows actually reference are kept, and those columns are renumbered \(0..K-1\) via the matrix’s index_map. Sized \(n_\text{local rows} \times K\).
x_halo is the value buffer this rank receives during the halo exchange: one entry per remote column the off-diagonal matrix references, ordered to match the compressed column numbering of A_off_diag.

Both blocks are accessible through the public API: mat->get_diag_matrix() returns the diagonal matrix and mat->get_off_diag_matrix() returns the off-diagonal matrix. The “diagonal” prefix refers to the block structure of the partitioned global matrix — every rank’s diagonal piece sits on the block-diagonal of the permuted global matrix — not to a matrix that is literally diagonal.

The halo pattern (which remote rows this rank needs) is derived from IndexMap at construction time and is fixed for the lifetime of the matrix. It is not recomputed on every apply call.

Halo exchange — gather and scatter#

The communication protocol for a single distributed apply proceeds as follows:

Gather     — each rank packs the x values that its neighbours need into a send buffer.
Exchange   — a non-blocking all-to-all-variable (MPI_Alltoallv via i_all_to_all_v) transfers the packed values.
Diagonal   — the diagonal matrix apply runs in parallel with the exchange.
Wait       — complete the MPI operation, ensuring x_halo is populated.
Off-diag   — apply the off-diagonal matrix using the received x_halo values.
Sum        — y_local = diagonal-matrix result + off-diagonal-matrix result.

The exchange is a single non-blocking variable-sized all-to-all (mpi::request i_all_to_all_v(...) from the collective communicator), not a sequence of point-to-point MPI_Isend / MPI_Irecv pairs. This lets MPI implementations pick the optimal underlying transport — including direct neighbour-to-neighbour sends when the communicator topology makes that natural — without Ginkgo having to enumerate the send/receive list manually.

In Ginkgo, the gather step is implemented by a RowGatherer. The scatter (step 5) uses the per-Vector halo buffer. Both objects are set up once during matrix construction and reused across all subsequent apply calls, so the bookkeeping cost is amortised.

Where MPI happens#

MPI calls are confined to the core distributed layer (core/distributed/...). Backend-specific kernel files (cuda/, hip/, dpcpp/) never include mpi.h. This separation is a deliberate maintainability constraint:

The distributed Matrix::apply starts the diagonal-matrix kernel on the executor.
The RowGatherer packs send values into a contiguous buffer (host buffer for standard MPI; a device-accessible buffer for CUDA-aware MPI).
The collective communicator issues an i_all_to_all_v to exchange the packed values.
Once the receive completes, the off-diagonal-matrix kernel fires on the executor.

GPU kernels never depend on MPI, and host-side MPI code never touches CUDA or HIP headers. The consequence is that adding a new GPU backend does not require touching the distributed communication layer at all.

Stream-MPI ordering for GPU executors#

When using a GPU executor with a non-GPU-aware MPI, the gather kernel that populates the send buffer runs on the executor’s device stream while the MPI call runs on the host. The kernel must finish before the host-side MPI call reads the staged buffer. Without an explicit synchronisation point, the kernel may still be executing on the device stream when MPI starts reading.

The straightforward way to enforce the ordering in user code is a full executor sync after the kernel and before the MPI call:

exec->synchronize();
// Now safe to launch the host-side i_all_to_all_v on the gathered buffer.

This waits for everything queued on the executor’s stream, not just the gather kernel. Internally Ginkgo’s RowGatherer does something more granular — it records an event on the stream right after the gather kernel and waits on just that event before issuing the all-to-all — but the event-recording machinery (gko::detail::Event, the event::record_event kernel registered in core/distributed/row_gatherer.cpp) is internal-by-convention and not a stable user-facing API. For an external distributed LinOp, the executor-level sync is the supported path.

With a GPU-aware MPI build (-DGINKGO_FORCE_GPU_AWARE_MPI=ON at configure time), the device pointer can be passed directly to the all-to-all call without any host-side synchronisation — the CUDA / HIP stream semantics still apply, but MPI manages the ordering.

Tip

Most user code does not orchestrate halo exchange manually. Matrix::apply handles all of the above internally. The ordering details matter for users implementing custom distributed LinOp operations that perform their own MPI communication.

A diagram of distributed apply#

Rank 0                              Rank 1
──────────────────────────          ──────────────────────────
A_diag * x_local (kernel)            A_diag * x_local (kernel)
         |                                   |
         |  halo exchange (MPI_Alltoallv)    |
         |  send boundary x values ────────> |
         | <──────────────────────  send boundary x values
         |                                   |
A_off_diag * x_halo (kernel)         A_off_diag * x_halo (kernel)
         |                                   |
         v                                   v
      y_local                             y_local

The diagonal-matrix and off-diagonal-matrix kernels can overlap with the MPI exchange in time — the diagonal-matrix result does not depend on halo data, so both sides start immediately while the exchange is in flight.