Fast way to reassemble SparseArrays

Question

I often have to deal with nonlinear shape optimization problems involving curves or surfaces surfaces. A classical example is Plateau's problem.

No matter if I use gradient-based techniques (such as mean curvature flow or $H^1$-gradient flow) or Newton's method, I find myself in the position of having to assemble a relatively large SparseArray in each iteration (in order to solve some system of linear equations afterwards). The point is, the sparsity pattern of these matrices are always the same; it is just the nonzero values that change. Actually, the same occurs if one has to solve, e.g., nonlinear elliptic PDE or parabolic PDE with time-dependent or state-dependent coefficients (see here for such an example).

Here is a typical example for assembling the discrete Laplace-Beltrami operator on a triangle mesh. We start with two functions for computing the values and the respective positions where they have to be added into the final matrix.

getLaplacian = Quiet@Block[{xx, x, PP, P, UU, U, f, Df, u, Du, v, Dv, g, integrant, quadraturepoints, quadratureweights}, xx = Table[Part[x, i], {i, 1, 2}]; PP = Table[Part[P, i, j], {i, 1, 3}, {j, 1, 3}]; UU = Table[Part[U, i], {i, 1, 3}]; f = x \[Function] PP[[1]] + x[[1]] (PP[[2]] - PP[[1]]) + x[[2]] (PP[[3]] - PP[[1]]); Df = x \[Function] Evaluate[D[f[xx], {xx}]]; g = x \[Function] Evaluate[Df[xx]\[Transpose].Df[xx]]; u = x \[Function] UU[[1]] + x[[1]] (UU[[2]] - UU[[1]]) + x[[2]] (UU[[3]] - UU[[1]]); Du = x \[Function] Evaluate[D[u[xx], {xx}]]; integrant = x \[Function] Evaluate[D[Du[xx].Inverse[g[xx]].Du[xx] Sqrt[Abs[Det[g[xx]]]], {UU, 2}]]; quadraturepoints = {{1/3, 1/3}}; quadratureweights = {1/2}; With[{code = N[quadratureweights.Map[integrant, quadraturepoints]] /. Part -> Compile`GetElement}, Compile[{{P, _Real, 2}}, code, CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed" ]]]; getLaplacianCombinatorics = Block[{ff}, With[{code = Flatten[Table[{ Compile`GetElement[ff, i], Compile`GetElement[ff, j]}, {j, 1, 3}, {i, 1, 3}], 1]}, Compile[{{ff, _Integer, 1}}, code, CompilationTarget -> "C", RuntimeAttributes -> {Listable}, Parallelization -> True, RuntimeOptions -> "Speed" ]]];

Now, let's create a very fine triangle mesh on my beloved "Triceratops".

R = ExampleData[{"Geometry3D", "Triceratops"}, "MeshRegion"]; R = DiscretizeRegion[R, MaxCellMeasure -> {1 -> 0.01}]; MeshCellCount[R, 0] MeshCellCount[R, 2]

666191

1332378

And this is how the matrix gets assembled

tuples = MeshCells[R, 2, "Multicells" -> True][[1, 1]]; pat = Flatten[getLaplacianCombinatorics[tuples], 1]; // RepeatedTiming // First vals = Flatten[getLaplacian[Partition[MeshCoordinates[R][[Flatten[tuples]]], 3]]]; // RepeatedTiming // First A = With[{spopt = SystemOptions["SparseArrayOptions"]}, Internal`WithLocalSettings[ SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> Total}], SparseArray[pat -> vals, {MeshCellCount[R, 0], MeshCellCount[R, 0]}, 0.], SetSystemOptions[spopt] ] ]; // RepeatedTiming // First

0.21

0.20

0.72

What catches the eye is that the mere assembly of the matrix takes several times longer than the actual number crunching in getLaplacian. Now think of the surface has moved slightly during the optimization process so that the MeshCoordinates have changed while all the combinatorics (the MeshCells) stay the same. That means, we can reuse pat, but we have to 1.) recompute vals and 2.) reassemble the matrix A.

As time is often crucial and because the assembly of the matrix needs about 80% of the time, I wonder whether this can be improved.

You can probably shave off a bit with ordering the pat and vals: ord = Ordering[pat]; po = pat[[ord]]; vo = vals[[ord]]; — user21
– user21, Commented Jun 15, 2018 at 7:23
How much time does the LinearSolve then take compared to the assembly? — user21
– user21, Commented Jun 15, 2018 at 7:24
By using a pattern sparse array you can, I think, gauge how much time is used for the structure creation: SparseArray[po -> _, {MeshCellCount[R, 0], MeshCellCount[R, 0]}, 0.]; // RepeatedTiming // First — user21
– user21, Commented Jun 15, 2018 at 7:31
@user21 LinearSolve[A, Method -> "Pardiso"] (for a matrix A of the same size but that has no nullspace) takes about 2.2 seconds. However, when I use MKL Pardiso over LibraryLink and reuse the symbolic factorization, the numerical factorization after refreshing the nonzero values needs only 0.4 seconds, so roughly half as long as the assembly. — Henrik Schumacher
– Henrik Schumacher, Commented Jun 15, 2018 at 9:08
@user2 Your pattern array inspired me to review my code and to simplify it considerably. (But it might be that it will work only with vectors and matrices...) — Henrik Schumacher
– Henrik Schumacher, Commented Jun 15, 2018 at 10:47

Henrik Schumacher · Accepted Answer · 2020-05-29 11:55:39Z

The principal idea behind this approach is to precompute the "RowPointers" and "ColumnIndices" of the final matrix along with a list pos such that val[[i]] has to be written into the pos[[i]]-th entry of the "NonzeroValues". Once the "NonzeroValues" are assembled, the SparseArray can be generated quickly by exploiting its FullForm. This way, we essentially boil down the assembly of the SparseArray to the assembly of its "NonzeroValues". Once this data is computed, we can store it in an AssemblyFunction object in order to apply it later to the list vals.

Implementation

This defines a "class" AssemblyFunction with constructor Assembly and only one method: Assembling a SparseArray when provided with a list of values that matches the expected input length.

SetAttributes[AssemblyFunction, HoldAll]; Assembly::expected = "Values list has `2` elements. Expected are `1` elements. Returning prototype."; Assembler[pat_?MatrixQ, dims_, background_: 0.] := Module[{pa, c, ci, rp, pos}, pa = SparseArray`SparseArraySort@SparseArray[pat -> _, dims]; rp = pa["RowPointers"]; ci = pa["ColumnIndices"]; c = Length[ci]; pos = cLookupAssemblyPositions[Range[c], rp, Flatten[ci], pat]; Module[{a}, a = <| "Dimensions" -> dims, "Positions" -> pos, "RowPointers" -> rp, "ColumnIndices" -> ci, "Background" -> background, "Length" -> c |>; AssemblyFunction @@ {a} ] ]; AssemblyFunction /: a_AssemblyFunction[vals0_] := Module[{len, expected, dims, u, vals}, If[VectorQ[vals0], vals = vals0, vals = Flatten[vals0]]; len = Length[vals]; expected = Length[a[[1]][["Positions"]]]; dims = a[[1]][["Dimensions"]]; If[len === expected, If[Length[dims] == 1, u = ConstantArray[0., dims[[1]]]; u[[a[[1]][["ColumnIndices"]]]] = AssembleDenseVector[a[[1]][["Positions"]], vals, {a[[1]][["Length"]]}]; u , SparseArray @@ {Automatic, dims, a[[1]][["Background"]], {1, {a[[1]][["RowPointers"]], a[[1]][["ColumnIndices"]]}, AssembleDenseVector[a[[1]][["Positions"]], vals, {a[[1]][["Length"]]}]}}] , Message[Assembly::expected, expected, len]; Abort[] ] ]; cLookupAssemblyPositions = Compile[ {{vals, _Integer, 1}, {rp, _Integer, 1}, {ci, _Integer, 1}, {pat, _Integer, 1}}, Block[{k, c, i, j}, i = Compile`GetElement[pat, 1]; j = Compile`GetElement[pat, 2]; k = Compile`GetElement[rp, i] + 1; c = Compile`GetElement[rp, i + 1]; While[k < c + 1 && Compile`GetElement[ci, k] != j, ++k]; Compile`GetElement[vals, k] ], RuntimeAttributes -> {Listable}, Parallelization -> True, CompilationTarget -> "C", RuntimeOptions -> "Speed" ]; AssembleDenseVector = Compile[{{ilist, _Integer, 1}, {values, _Real, 1}, {dims, _Integer, 1}}, Block[{A}, A = Table[0., {Compile`GetElement[dims, 1]}]; Do[A[[Compile`GetElement[ilist, i]]] += Compile`GetElement[values, i], {i, 1, Length[values]}]; A ], CompilationTarget -> "C", RuntimeOptions -> "Speed" ];

Example

Finally, a basic usage example for the example provided above.

a = Assembler[pat, {MeshCellCount[R, 0], MeshCellCount[R, 0]}, 0.]; // RepeatedTiming // First B = a[vals]; // RepeatedTiming // First Max[Abs[A - B]]/Max[Abs[A]]

0.87

0.072

2.02443*10^-17

Creating the AssemblyFunction object costs only a little bit more than a standard assembly, but we only have to invest it once and the new assembly process is now ten times faster. So this initial investment pays off starting with the second assembly.

Final Remarks

Another important advantage is that matrices produced by AssemblyFunction are internally ordered as prescribed by the CRS format:

SparseArray`SparseArraySortedQ[A] SparseArray`SparseArraySortedQ[B]

False

True

So far, I did not manage to get any benefit from parallelization. This is however not a simple task to parallelize since all subprocesses have to write into a common list. So one has to make up clever ways to distribute the write permissions. For example, write permissions can be managed by suitable graph (or mesh) colorings, if the matrix stems from a finite element discretization...

Update

This is now also a part of the package "PardisoLink`". see here for a usage example.

Stack Exchange Network

Fast way to reassemble SparseArrays

1 Answer 1

Implementation

Example

Final Remarks

Update

Linked

Hot Network Questions

Fast way to reassemble SparseArrays

1 Answer 1

Implementation

Example

Final Remarks

Update

Linked

Related

Hot Network Questions