Sellp (Slice-ELLPACK)#

Sliced ELLPACK is ELL’s answer to non-uniform row lengths. Instead of padding every row to the globally longest one, Sellp partitions the matrix into fixed-size slices of contiguous rows and pads each slice to its own longest row. A single outlier no longer inflates the storage of every other row in the matrix — only the slice it lives in. The default slice size matches a GPU warp, which keeps memory access coalesced within a slice while shrinking padding overall.

Storage layout#

The rows are grouped into slices of size \(S\) (default 32). For slice \(s\) spanning rows \([sS,\, (s+1)S)\), let \(k_s\) be the longest non-zero count in that slice. Sellp stores, per slice, two column-major arrays of size \(S \times k_s\) — the values and column indices of every entry, padded so all rows in the slice present \(k_s\) slots. A slice_sets prefix-sum array records each slice’s offset in slot-column units — slice \(s\) occupies the flat range \([\mathit{slice\_sets}[s] \cdot S,\, \mathit{slice\_sets}[s+1] \cdot S)\) of the values / col-indices buffers — and a slice_lengths array records each \(k_s\).

Matrix (n_rows = 4):       Two slices of S = 2:
[ 1  2  0 ]                Slice 0 covers rows [0, 1]:  k_0 = 2
[ 0  3  0 ]                  values    = [1, 3,  2, 0]
[ 4  0  5 ]                  col_idxs  = [0, 1,  1, *]
[ 0  6  0 ]                Slice 1 covers rows [2, 3]:  k_1 = 2
                             values    = [4, 6,  5, 0]
                             col_idxs  = [0, 1,  2, *]

The slice_sets array would be [0, 2, 4] (slice 0 spans slot-columns 0..2, slice 1 spans 2..4 — multiplied by \(S=2\) this gives flat extents 0..4 and 4..8 in the values / col-indices buffers), and slice_lengths would be [2, 2]. (* = padding: value = 0, col_idx = invalid_index().)

When to use Sellp#

Non-uniform but locally similar row lengths — typical of unstructured finite-element matrices where row degree varies across the mesh but is similar between adjacent rows after a band-reducing reordering (see RCM).
GPU SpMV where ELL would waste too much space. Each slice still gives coalesced access within itself; only the slices that need it grow wide.
As a fall-back from ELL when row-length variance becomes too costly, before reaching for the heavier Hybrid split.

The slice size \(S\) is configurable; matching the GPU warp width (32 on NVIDIA, 64 on AMD) is the standard choice, but smaller slices reduce padding at the cost of slightly worse coalescing.

Construction#

// From matrix_data — slice_size defaults to 32
auto sellp = gko::matrix::Sellp<double, int>::create(exec);
sellp->read(matrix_data);

// With an explicit slice size (and slice-stride factor for alignment)
auto sellp = gko::matrix::Sellp<double, int>::create(
                 exec,
                 gko::dim<2>{n_rows, n_cols},
                 /*slice_size=*/ 32,
                 /*stride_factor=*/ 1,
                 /*total_cols=*/ 0);   // 0 = auto

// Converted from CSR
auto sellp = gko::matrix::Sellp<double, int>::create(exec);
sellp->copy_from(csr);

A re-ordering step (e.g. RCM) before conversion can dramatically reduce the per-slice maxima and thus the storage overhead.

Performance characteristics#

Sellp’s per-output-row SpMV cost is \(O(k_s)\) memory traffic for the slice that row belongs to. Compared with ELL’s globally-uniform \(k\), the average traffic drops to roughly \(\sum_s k_s / n_\text{slices} \times S\) — proportional to the slice-wise maximum rather than the global one.

The break-even versus ELL is roughly: if the slice-wise maxima are within a small factor of the average row length, Sellp shines. If the row-length distribution is essentially flat, plain ELL’s simpler kernel is usually a hair faster. If the distribution has a hard heavy tail (one or two rows with thousands of entries while the rest have a handful), Sellp’s slice padding can still amplify; reach for Hybrid instead.

Conversion behaviour#

Sellp converts to and from CSR, ELL, and COO:

// Sellp → CSR
auto csr = gko::matrix::Csr<double, int>::create(exec);
csr->copy_from(sellp);

// CSR → Sellp
auto sellp = gko::matrix::Sellp<double, int>::create(exec);
sellp->copy_from(csr);

copy_from(csr) re-slices the matrix using the current slice size and stride factor; if those have not been set, the defaults are used.