Dense communicator#

gko::experimental::mpi::DenseCommunicator is the default CollectiveCommunicator implementation. It performs the halo exchange with MPI_Ialltoallv over the full rank set: every rank passes the entire per-rank send size and offset vector to MPI, and ranks that exchange zero bytes simply have a zero entry. This is the safe, portable default — it works on every MPI 3.x and 4.x implementation and does not require the platform’s MPI to expose distributed-graph topology support.

When to use it#

  • You are not running at extreme rank counts. Up to a few thousand ranks, the constant-factor overhead of MPI_Ialltoallv over the full rank set is negligible compared to the actual data transfer.

  • You want maximum portability. Some MPI builds disable or under-implement MPI_Ineighbor_alltoallv (the alternative); the dense path is always available.

  • You do not yet know your halo connectivity is sparse. Until profiling shows the all-to-all is dominating, the dense pattern is a perfectly reasonable choice.

For high rank counts with sparse halo connectivity (typical FE / FD meshes), prefer NeighborhoodCommunicator — it lets MPI skip the zero-size entries entirely.

Construction#

The expected pattern is to derive the communicator from an IndexMap, which already encodes which remote ranks own which indices this rank needs:

namespace mpi = gko::experimental::mpi;

auto base_comm = mpi::communicator(MPI_COMM_WORLD);

auto coll_comm = std::make_shared<mpi::DenseCommunicator>(base_comm, imap);

The constructor reads the index map’s remote_target_ids and remote_global_idxs to compute the recv-side per-rank sizes, then performs an MPI_Alltoall over the size vector to discover the send-side sizes. After this point the communicator is “primed” — every subsequent i_all_to_all_v(...) call uses the cached size and offset vectors.

A null-pattern constructor is also available — DenseCommunicator(base_comm) builds a communicator with empty send and receive sets. It exists mainly so that distributed types can be default-constructed before their pattern is known.

What it stores#

Internally the dense communicator caches four std::vector<comm_index_type> per rank:

  • send_sizes_ and send_offsets_ — how many values this rank sends to each peer rank, and the start index of that block in the send buffer.

  • recv_sizes_ and recv_offsets_ — the receive-side equivalents.

get_send_size() and get_recv_size() return the cumulative totals (the back element of the respective offsets vector), which is the exact buffer size each side needs to allocate. The matrix’s RowGatherer reads these to size the contiguous send / receive buffers it manages.

The exchange#

auto req = coll_comm->i_all_to_all_v(exec, send_ptr, recv_ptr);
// ... overlap work that does not touch recv_ptr ...
req.wait();

On modern MPI (≥ 4.1) the underlying primitive is MPI_Ialltoallv — a non-blocking variable-size all-to-all. On Open MPI builds older than 4.1 the implementation falls back to the blocking MPI_Alltoallv (since older Open MPI did not implement the non-blocking form correctly for all type combinations); the returned mpi::request is a default-constructed handle that completes immediately. From the caller’s perspective, wait() is always the right thing to call.

Cost model#

For \(P\) ranks, the per-call work is:

  • \(\Theta(P)\) to scan the size and offset vectors on each rank (small, typically cache-resident).

  • \(\Theta(N_\text{send})\) to actually transfer the values, where \(N_\text{send}\) is the total number of values this rank sends.

The MPI_Ialltoallv collective itself is internally implemented by most MPI runtimes as either a direct exchange between active senders and receivers (fast for sparse patterns) or a reduction tree (better for dense patterns). The Ginkgo side does not control the choice — it just supplies the per-rank size vector and lets MPI pick.

See also