Direct (LU / Cholesky)#

gko::experimental::solver::Direct is Ginkgo’s sparse direct solver. It computes a sparse LU or Cholesky factorisation once at generate() time and reuses the factors across apply() calls. The factorisation lives in two LinOps — a lower and an upper triangular solver — so the cost of a solve is two triangular sweeps, not a re-factorisation.

Construction#

auto factory = gko::experimental::solver::Direct<double, gko::int32>::build()
                   .with_factorization(
                       gko::experimental::factorization::Lu<double, gko::int32>::build()
                           .on(exec))
                   .on(exec);

auto solver = factory->generate(system_matrix);
solver->apply(b, x);

The with_factorization parameter accepts any factorisation factory whose output is a gko::experimental::factorization::Factorization. In practice that means:

  • gko::experimental::factorization::Lu — general unsymmetric matrices.

  • gko::experimental::factorization::Cholesky — symmetric positive-definite matrices.

Both factories produce a sparse factorisation, not a dense one. The cost and memory footprint depend on the matrix’s sparsity and the row/column ordering.

Reorder before factorising#

The single biggest lever for performance and memory of a sparse direct solver is the row/column ordering. Without a fill-reducing reordering, the L and U factors of a typical unstructured matrix can be many times larger than the original matrix; with a good reordering they often stay within a small constant factor.

The recommended preprocessing recipe is:

  1. Match (optional): apply Mc64 to push large entries onto the diagonal, improving pivoting stability.

  2. Reorder: apply Amd (default for small / medium matrices) or NestedDissection (for large structured systems; requires METIS) to reduce fill-in. As an alternative — particularly when the triangular solves are bandwidth-bound on CPU — Rcm minimises bandwidth/profile instead of fill; it can produce different (sometimes better) cache behaviour than the fill-reducers but is not a fill-reducing reordering itself.

  3. Factorise with Direct on the permuted matrix.

  4. Solve with apply(b, x).

The ScaledReordered linear operator bundles steps 1–3 around an inner solver, so the caller passes the original system matrix and right-hand side and receives the permuted result transparently. See Reordering and permutations for the full taxonomy of algorithms and worked examples.

Triangular solves under the hood#

Internally, Direct stores the factors as a pair of triangular LinOps:

  • gko::solver::LowerTrs for L y = b

  • gko::solver::UpperTrs for U x = y

These two solvers can also be used on their own. If you already have L and U from an external factorisation library and only need to apply the triangular sweeps, construct LowerTrs and UpperTrs directly from the CSR matrices.

On the CUDA backend, both triangular solvers expose two algorithm choices via with_algorithm:

  • trisolve_algorithm::sparselib — cuSPARSE’s triangular solve (default).

  • trisolve_algorithm::syncfree — Ginkgo’s own sync-free implementation, often faster for highly parallel triangular structures.

On the other backends the parameter is part of the API for uniformity but does not change which implementation runs — Reference, OMP, HIP, and DPC++ each ship a single triangular-solve kernel, and the kernel does not branch on algorithm.

cuDSS on NVIDIA GPUs#

For CUDA workloads where Ginkgo’s own factorisation is the bottleneck — typically large unstructured systems — the gko::ext::cuda::solver::Cudss extension wraps NVIDIA’s cuDSS sparse direct solver. cuDSS supports unsymmetric and symmetric factorisations, applies its own reordering and pivoting heuristics, and exposes a refactorize entry point that reuses both the symbolic analysis and the factorization data structures when only the numerical values change. The extension is built automatically when CUDA support is enabled and CMake discovers a cuDSS installation — see Build-time enablement for details.

When sparse direct is (and is not) appropriate#

Sparse direct solvers can handle large systems when paired with a fill-reducing reordering — that is exactly what production sparse-direct packages (SuperLU, MUMPS, cuDSS) do for million-degree-of-freedom problems. They are particularly attractive when:

  • The same matrix is used for many right-hand sides — the factorisation is amortised.

  • Numerical robustness matters more than peak iteration speed.

  • The sparsity pattern is fixed across many solves with changing values — passing a precomputed symbolic_factorization to the Lu / Cholesky factory lets subsequent generate() calls skip the symbolic step. On CUDA, the cuDSS extension goes further with an in-place refactorize that updates only the numerical values.

The classical caveats:

  • Fill-in. Without a good reordering, factorisation memory and time grow super-linearly. For very large unstructured matrices, even a well-reordered factorisation may not fit in memory.

  • Single-precision wins are limited. Direct solvers are bandwidth- and flop-bound on the factors; the speedup from low-precision arithmetic is usually smaller than for Krylov methods.

  • Cross-backend coverage varies. The Ginkgo core implementation works on every backend, but vendor-tuned alternatives (cuDSS) are CUDA-only.

When the system is too large for a direct factorisation, switch to a Krylov solver with an ILU/IC or AMG preconditioner — see Solvers — taxonomy.

Publications#

See also