-
Notifications
You must be signed in to change notification settings - Fork 69
Performance tips
Here are a few tips to keep in mind while writing code that uses Fastor
For fast code you need to ensure you first turn on compiler optimisation using the -O2
/-O3
(-O3
in particular) compiler flag or /O2
under Visual Studio.
Fastor uses a lot sanity checks under the hood to verify the validity and consistency of expressions when they are being assigned from one to other, specifically under debug mode. The use of the compiler flag -DNDEBUG
or /D:NDEBUG
is highly recommended if you want fast code. This flag is activated by default under release mode by most compilers.
Fastor can vectorise almost all operations using the CPU's vector instructions (also called SIMD intrinsics). This results in substantial performance improvements and you need to make sure to activate the appropriate SIMD vectorsation that your CPU supports through a compiler flag for instance, -msse2/-mavx/-mavx2/-mfma
with GNU based compiler or /arch:[sse2,avx,avx512]
with Visual Studio. In most cases while compiling with GCC/Clang/Intel providing the -march=native
flag is sufficient.
If your code uses a lot of complex Fastor expressions it is beneficial to force the compiler to aggressively inline your functions by using the following additional compiler flags
GCC
-finline-limit=<a high value>
Clang
-llvm -inline-threshold=<a high value>
Intel
-inline-forceinline
MSVC
/Ob2
Avoid passing Fastor tensors by value to functions as this will make unnecessary and expensive copies. Instead always pass by reference. So if your function looks like this
void foo(Tensor<T,M,N> a) { .... }
you should change it to
void foo(const Tensor<T,M,N> &a) { .... }
This is not a performance bottleneck but rather something to keep in mind if you want utmost performance. While working with views, you should try to avoid assigning tensors of different, for instance
Tensor<double,3,4,5> a;
Tensor<double,3,2> b;
// Assigning part of a third order tensor to a second order tensor
b(all,1) = a(all,2,0);
While understandably this syntax is quite convenient (and you should use it) keep in mind that in such cases Fastor has to create two set of offsets for the two tensors in order extract the part from one tensor and assign it to another tensor (of different order). This does not impact the performance much and in most cases you should not expect a degraded performance. However, in certain cases the compiler will simply give up doing too much work. A rather verbose workflow for this is to use the TensorMap
feature to map your tensors to the same order first before assigning them for instance, in the above example you can do
Tensor<double,3,4,5> a;
Tensor<double,3,2> b;
// map/pomote b to a third order tensor first - this does not copy b
TensorMap<double,3,2,1> bmap(b);
// Now, assign a to bmap instead - this will also change b
bmap(all,1,0) = a(all,2,0);
In this case Fastor knows that both the left and the right hand sides are the same order and skips unnecessary offset computations.