-
Notifications
You must be signed in to change notification settings - Fork 81
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizing the assembly over the elements #450
Comments
If you want to optimize for performance, I would suggest profiling the memory usage of a single thread assembly first. There might be some easy ways to get the code faster. |
Could be, but I'm pretty happy with the single thread performance for my own use cases, and was only surprised in #439 by the amount of overhead generated by evaluating a complicated form. I've been previously doing plenty of profiling outside of the form evaluation parts of the code and I'm quite convinced that there isn't massive gains (in terms of processor time) there. Especially now that almost everything gets precomputed in Everything that happens within a form is basically multiplication and summation of large 2D NumPy arrays, and I don't know how much we're really able to improve that because the forms are provided by the user. On the otherhand, in terms of memory usage there could be plenty of improvements. Edit: I'm happy to be proven wrong though :-) |
I mean, comparing the numbers in the README for Laplace equation and the numbers for the Jacobian for Neohookean hyperelasticity (both evaluated by my laptop), it's obvious that the overhead comes from the form, especially since same element is used in both cases. One option would be to try to JIT compile the entire form using Numba or similar. |
I tried to JIT compile the above form using Numba:
So 4x faster. But the form is 100x more ugly: @jit(nopython=True)
def numbajac(du, dv, dw):
out = np.zeros_like(du[0, 0])
for a in range(dw.shape[2]):
for b in range(dw.shape[3]):
dw[0, 0, a, b] += 1
dw[1, 1, a, b] += 1
dw[2, 2, a, b] += 1
J = np.zeros_like(out)
for a in range(J.shape[0]):
for b in range(J.shape[1]):
J[a,b] += dw[0, 0, a, b] * (dw[1, 1, a, b] * dw[2, 2, a, b] -
dw[1, 2, a, b] * dw[2, 1, a, b]) -\
dw[0, 1, a, b] * (dw[1, 0, a, b] * dw[2, 2, a, b] -
dw[1, 2, a, b] * dw[2, 0, a, b]) +\
dw[0, 2, a, b] * (dw[1, 0,a ,b] * dw[2, 1, a, b] -
dw[1, 1, a, b] * dw[2, 0, a, b])
Finv = np.zeros_like(dw)
for a in range(J.shape[0]):
for b in range(J.shape[1]):
Finv[0, 0, a, b] += (-dw[1, 2, a, b] * dw[2, 1, a, b] +
dw[1, 1, a, b] * dw[2, 2, a, b]) / J[a, b]
Finv[1, 0, a, b] += (dw[1, 2, a, b] * dw[2, 0, a, b] -
dw[1, 0, a, b] * dw[2, 2, a, b]) / J[a, b]
Finv[2, 0, a, b] += (-dw[1, 1, a, b] * dw[2, 0, a, b] +
dw[1, 0, a, b] * dw[2, 1, a, b]) / J[a, b]
Finv[0, 1, a, b] += (dw[0, 2, a, b] * dw[2, 1, a, b] -
dw[0, 1, a, b] * dw[2, 2, a, b]) / J[a, b]
Finv[1, 1, a, b] += (-dw[0, 2, a, b] * dw[2, 0, a, b] +
dw[0, 0, a, b] * dw[2, 2, a, b]) / J[a, b]
Finv[2, 1, a, b] += (dw[0, 1, a, b] * dw[2, 0, a, b] -
dw[0, 0, a, b] * dw[2, 1, a, b]) / J[a, b]
Finv[0, 2, a, b] += (-dw[0, 2, a, b] * dw[1, 1, a, b] +
dw[0, 1, a, b] * dw[1, 2, a, b]) / J[a, b]
Finv[1, 2, a, b] += (dw[0, 2, a, b] * dw[1, 0, a, b] -
dw[0, 0, a, b] * dw[1, 2, a, b]) / J[a, b]
Finv[2, 2, a, b] += (-dw[0, 1, a, b] * dw[1, 0, a, b] +
dw[0, 0, a, b] * dw[1, 1, a, b]) / J[a, b]
for i in range(du.shape[0]):
for j in range(du.shape[0]):
for k in range(du.shape[0]):
for l in range(du.shape[0]):
for a in range(J.shape[0]):
for b in range(J.shape[1]):
out[a, b] += ((mu + lmbda * J[a, b] * (J[a, b] - 1)) * Finv[i, j, a, b] * Finv[k, l, a, b] +\
lmbda * (2 * J[a, b] - 1) * J[a, b] * Finv[j,i,a,b] * Finv[l,k,a,b])\
* du[i, j, a, b] * dv[k, l, a, b]
return out
@BilinearForm
def jac2(u, v, w):
return numbajac(*(grad(u), grad(v), grad(w['w']))) I suppose it will be even faster if entire |
@ahojukka5 is correct to point out that one should always profile before attempting to optimize, but there is another aspect to the general issue of ‘parallelizing the assembly over the elements’. Although optimization seems to have been in mind here, besides the direct use of scikit-fem as a finite element assembler, another of its primary purposes is ‘a focus on computational experimentation’ (paper.md) and much contemporary research in finite element methodology does concern parallelization, so it might be worth getting some technniques and idioms in place for parallelizing the assembly over the elements, regardless of whether it's faster or slower. For that, I think
Actually following on from above, I'm thinking about not combining the resulting matrices at all, but leaving them as separate operators in a domain decomposition setting. i have to attend to other matters now, but do hope to return to this, probably using On speeding up the assembly of hyperelastic forms #439, I'm puzzled that using |
I had been thinking about this for a while but a recent additional motivation came from Knoll & Keyes (2004, §3.2 ‘Newton–Krylov–Schwarz’):
|
@kinnala |
I actually tried this and the gain was insignificant, something like 20% max. improvement to running time. The idea would have been to inline and JIT all loops starting from this: https://github.com/kinnala/scikit-fem/blob/master/skfem/assembly/form/bilinear_form.py#L63 But it requires some major refactoring of different pieces and the gain was too small for the added complexity. |
I see. Thanks a lot!! It does seem that JIT-ing the form is pretty much it as far as speeding up the code, right? |
Yes I think so, unless you want to try what's suggested in the title of the issue and the first post, i.e. parallelize over elements. That should end up being multiple times faster if done properly. |
I see. I will try experimenting with dask and explore how the individual matrices could be combined and so on.. |
My guess is that if parallelization over elements is to be performed, combining the results should be done before a call to |
I found out that with Just modify the file # loop over the indices of local stiffness matrix
for j in range(ubasis.Nbfun):
for i in range(vbasis.Nbfun):
ixs = slice(nt * (vbasis.Nbfun * j + i),
nt * (vbasis.Nbfun * j + i + 1))
rows[ixs] = vbasis.element_dofs[i]
cols[ixs] = ubasis.element_dofs[j]
data[ixs] = self._kernel(
ubasis.basis[j],
vbasis.basis[i],
wdict,
dx,
) with from joblib import Parallel, delayed
ixs_list = []
bij_list = []
# loop over the indices of local stiffness matrix (pre-loop)
for j in range(ubasis.Nbfun):
for i in range(vbasis.Nbfun):
bij_list.append([ubasis.basis[j], vbasis.basis[i]])
ixs = slice(nt * (vbasis.Nbfun * j + i),
nt * (vbasis.Nbfun * j + i + 1))
ixs_list.append(ixs)
rows[ixs] = vbasis.element_dofs[i]
cols[ixs] = ubasis.element_dofs[j]
data_list = Parallel(n_jobs=-1, prefer="threads")(
delayed(self._kernel)(*bij, wdict, dx) for bij in bij_list)
for ixs, d in zip(ixs_list, data_list):
data[ixs] = d I get about a 2x speed-up of my three-field hyperelasticity example from #616 in combination with a 3x refined hex-mesh. What do you think about that? |
Cool, I think we used to have something like that but the speed improvement was visible only if GIL was circumvented, e.g., using numba |
Here it is
|
Is it the same thing here? https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism |
We originally removed the implementation using Edit: But I think we could add now an optional flag @BilinearForm(parallel_kernel=True)
def bilinf(u, v, w):
pass to enable this kind of behaviour. |
Yes, as far as I know joblib threading is the same as your original threading implementation. Optional flag sounds good! Errors could be checked without parallelization and then the flag could be turned on. I played with joblib some time ago and the best speed up can be achieved only by using the default backend. But things are getting complicated if the code uses custom classes (pickle errors)... |
Started this work in #625, check it out if you have time. |
Thanks @kinnala @adtzlr for the work. Just happen to have gotten back to this. For the example in #439, this is what I get for the assembly times on my laptop (haven't done a detailed timing analysis). A import numpy as np
from skfem.helpers import grad, dot
from numba import jit
from skfem import *
@jit(nopython=True, nogil=True)
def vdet(F):
J = np.zeros_like(F[0, 0])
for a in range(J.shape[0]):
for b in range(J.shape[1]):
J[a, b] += (
F[0, 0, a, b]
* (F[1, 1, a, b] * F[2, 2, a, b] - F[1, 2, a, b] * F[2, 1, a, b])
- F[0, 1, a, b]
* (F[1, 0, a, b] * F[2, 2, a, b] - F[1, 2, a, b] * F[2, 0, a, b])
+ F[0, 2, a, b]
* (F[1, 0, a, b] * F[2, 1, a, b] - F[1, 1, a, b] * F[2, 0, a, b])
)
return J
@jit(nopython=True, nogil=True)
def vinv(F):
J = vdet(F)
Finv = np.zeros_like(F)
for a in range(J.shape[0]):
for b in range(J.shape[1]):
Finv[0, 0, a, b] += (
-F[1, 2, a, b] * F[2, 1, a, b] + F[1, 1, a, b] * F[2, 2, a, b]
) / J[a, b]
Finv[1, 0, a, b] += (
F[1, 2, a, b] * F[2, 0, a, b] - F[1, 0, a, b] * F[2, 2, a, b]
) / J[a, b]
Finv[2, 0, a, b] += (
-F[1, 1, a, b] * F[2, 0, a, b] + F[1, 0, a, b] * F[2, 1, a, b]
) / J[a, b]
Finv[0, 1, a, b] += (
F[0, 2, a, b] * F[2, 1, a, b] - F[0, 1, a, b] * F[2, 2, a, b]
) / J[a, b]
Finv[1, 1, a, b] += (
-F[0, 2, a, b] * F[2, 0, a, b] + F[0, 0, a, b] * F[2, 2, a, b]
) / J[a, b]
Finv[2, 1, a, b] += (
F[0, 1, a, b] * F[2, 0, a, b] - F[0, 0, a, b] * F[2, 1, a, b]
) / J[a, b]
Finv[0, 2, a, b] += (
-F[0, 2, a, b] * F[1, 1, a, b] + F[0, 1, a, b] * F[1, 2, a, b]
) / J[a, b]
Finv[1, 2, a, b] += (
F[0, 2, a, b] * F[1, 0, a, b] - F[0, 0, a, b] * F[1, 2, a, b]
) / J[a, b]
Finv[2, 2, a, b] += (
-F[0, 1, a, b] * F[1, 0, a, b] + F[0, 0, a, b] * F[1, 1, a, b]
) / J[a, b]
return Finv
@jit(nopython=True, nogil=True)
def numbares(dv, dw):
out = np.zeros_like(dv[0, 0])
F = np.zeros_like(dw)
for a in range(dw.shape[2]):
for b in range(dw.shape[3]):
F[0, 0, a, b] += 1.0
F[1, 1, a, b] += 1.0
F[2, 2, a, b] += 1.0
F += dw
J = vdet(F)
Finv = vinv(F)
for i in range(dv.shape[0]):
for j in range(dv.shape[0]):
for a in range(J.shape[0]):
for b in range(J.shape[1]):
out[a, b] += (
mu * F[i, j, a, b]
- mu * Finv[j, i, a, b]
+ lmbda * J[a, b] * (J[a, b] - 1.0) * Finv[j, i, a, b]
) * dv[i, j, a, b]
return out
@jit(nopython=True, nogil=True)
def numbajac(du, dv, dw):
out = np.zeros_like(du[0, 0])
F = np.zeros_like(dw)
for a in range(dw.shape[2]):
for b in range(dw.shape[3]):
F[0, 0, a, b] += 1.0
F[1, 1, a, b] += 1.0
F[2, 2, a, b] += 1.0
F += dw
J = vdet(F)
Finv = vinv(F)
kron = np.eye(dw.shape[0])
for i in range(du.shape[0]):
for j in range(du.shape[0]):
for k in range(du.shape[0]):
for l in range(du.shape[0]):
for a in range(J.shape[0]):
for b in range(J.shape[1]):
out[a, b] += (
(
mu * kron[i, k] * kron[j, l]
+ (mu - lmbda * J[a, b] * (J[a, b] - 1))
* Finv[j, k, a, b]
* Finv[l, i, a, b]
+ lmbda
* (2 * J[a, b] - 1)
* J[a, b]
* Finv[j, i, a, b]
* Finv[l, k, a, b]
)
* du[i, j, a, b]
* dv[k, l, a, b]
)
return out
@jit(nopython=True, nogil=True)
def numbaEnergy(dw):
out = np.zeros_like(dw[0, 0])
F = np.zeros_like(dw)
for a in range(dw.shape[2]):
for b in range(dw.shape[3]):
F[0, 0, a, b] += 1.0
F[1, 1, a, b] += 1.0
F[2, 2, a, b] += 1.0
F += dw
J = vdet(F)
I1 = np.zeros_like(dw[0, 0])
for a in range(J.shape[0]):
for b in range(J.shape[1]):
for i in range(dw.shape[0]):
for j in range(dw.shape[1]):
I1[a, b] += F[i, j, a, b] * F[i, j, a, b]
for a in range(J.shape[0]):
for b in range(J.shape[1]):
out[a, b] = (
mu / 2.0 * (I1[a, b] - 3.0)
- mu * np.log(J[a, b])
+ lmbda / 2.0 * (J[a, b] - 1.0) ** 2
)
return out
mesh = MeshTet().refined(4)
elem = ElementTetP1()
uelem = ElementVectorH1(elem)
iBasis = InteriorBasis(mesh, uelem, intorder=2)
fBasis = FacetBasis(mesh, uelem, intorder=2)
u = np.zeros(iBasis.N)
E, nu = 10.0, 0.3
mu = E / 2 / (1 + nu)
lmbda = 2 * mu * nu / (1 - 2 * nu)
@LinearForm(nthreads=17)
def rhs(v, w):
return -dot(bodyForce, v)
@LinearForm(nthreads=17)
def rhsSurf(v, w):
return -dot(surfaceTraction, v) * (restBoundary(w.x))
@BilinearForm(nthreads=17)
def jac2(u, v, w):
return numbajac(*(grad(u), grad(v), grad(w["w"])))
@LinearForm(nthreads=17)
def res2(v, w):
return numbares(*(grad(v), grad(w["w"])))
@Functional(nthreads=17)
def energy(w):
return numbaEnergy(grad(w["w"]))
w = iBasis.interpolate(u)
jac2.assemble(iBasis, w=w) Timings6.78 s ± 151 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # nthreads = 0
3.49 s ± 305 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # nthreads = 3
3.03 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # nthreads = 9
3.18 s ± 283 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # nthreads = 17 |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I'm investigating the feasibility of parallelizing assembly over elements.
Take the following snippet (from #439 ):
Assembly takes quite a while because so many FP operations are done inside the form:
This is despite there being only about 4k DOF's (Edit: 3 * 4k = 12 k DOF's)
What happens if we assemble only half of the elements?
Looking at
htop
while this is running, we see that assembly (in this case) is done mostly using single core.So we could try assembling ib1 and ib2 in parallel and save some time.
Let's try this using dask.
12.6 s < 16.4 s
So we seem to have a chance of saving some seconds by splitting the assembly over elements.
Remaining questions:
scipy.sparse
routines.I'll make a branch which explores this when I have time.
The text was updated successfully, but these errors were encountered: