SIMD Support
Since version 0.3 Julia has some vectorisation capabilities that can exploit SIMD instructions when executing loops. It seems to require some nudging though. There are macros that are a bit like the pragmas in OpenMP.
Example 1. SAXPY
function saxpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
n = length(x)
for i = 1:n
y[i] += a*x[i]
end
end
Sadly, in version 0.3.10 this obvious candidate does not auto-vectorise.
You can inspect how this code has compiled nicely in Julia by using the macros @code_llvm
to see the LLVM IR or @code_native
to see the ASM. The ASM produced is:
Source line: 48
mov R8, QWORD PTR [RSI + 16]
xor R11D, R11D
xor ECX, ECX
cmp RCX, R8
jae 65
cmp RCX, QWORD PTR [RDI + 16]
jae 55
lea R10, QWORD PTR [4*R11]
mov RAX, QWORD PTR [RSI + 8]
sub RAX, R10
mov RDX, QWORD PTR [RDI + 8]
sub RDX, R10
movss XMM1, DWORD PTR [RDX]
mulss XMM1, XMM0
addss XMM1, DWORD PTR [RAX]
movss DWORD PTR [RAX], XMM1
dec R11
inc RCX
cmp R9, RCX
jne -72
pop RBP
ret
movabs RAX, 140636405232096
mov RDI, QWORD PTR [RAX]
movabs RAX, 140636390845696
mov ESI, 48
call RAX
Note the scalar instructions movss
, mulss
, and addss
.
Example 2: SAXPY + SIMD
To make it generate vectorised instructions you have to use the explicit vectorisation macros @simd
and @inbounds
macros. @simd
gives the compiler license to vectorise without checking the legality of the transformation. @inbounds
is an optimisation that turns off subscript checking, because subscript checking might throw an exception and so isn’t vectorisable.
function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
n = length(x)
@simd for i = 1:n
@inbounds y[i] += a*x[i]
end
end
This now compiles with SIMD instructions:
...
Source line: 48
mov R8, QWORD PTR [RDI + 8]
mov R9, QWORD PTR [RSI + 8]
xor EDI, EDI
mov RSI, RAX
and RSI, -8
je 79
pshufd XMM1, XMM0, 0 # xmm1 = xmm0[0,0,0,0]
xor EDI, EDI
lea RCX, QWORD PTR [4*RDI]
mov RDX, R8
sub RDX, RCX
movups XMM3, XMMWORD PTR [RDX]
movups XMM2, XMMWORD PTR [RDX + 16]
mulps XMM3, XMM1
mov RDX, R9
sub RDX, RCX
movups XMM5, XMMWORD PTR [RDX]
movups XMM4, XMMWORD PTR [RDX + 16]
addps XMM5, XMM3
movups XMMWORD PTR [RDX], XMM5
mulps XMM2, XMM1
addps XMM2, XMM4
movups XMMWORD PTR [RDX + 16], XMM2
add RDI, -8
mov RCX, RSI
add RCX, RDI
jne -69
mov RDI, RSI
sub RAX, RDI
je 41
...
Note the instructions movups, addps, mulps… (English: move, unaligned, packed, single-precision). Packed => Vector.
We can also see from the LLVM IR:
define void @julia_axpy_21664(float, %jl_value_t*, %jl_value_t*) {
...
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%25 = getelementptr float* %20, i64 %index, !dbg !4955
%26 = bitcast float* %25 to <4 x float>*
%wide.load = load <4 x float>* %26, align 4
%.sum18 = or i64 %index, 4
%27 = getelementptr float* %20, i64 %.sum18
%28 = bitcast float* %27 to <4 x float>*
%wide.load9 = load <4 x float>* %28, align 4
%29 = getelementptr float* %24, i64 %index, !dbg !4955
%30 = bitcast float* %29 to <4 x float>*
%wide.load10 = load <4 x float>* %30, align 4
%31 = getelementptr float* %24, i64 %.sum18
%32 = bitcast float* %31 to <4 x float>*
%wide.load11 = load <4 x float>* %32, align 4
%33 = fmul <4 x float> %wide.load10, %broadcast.splat13
%34 = fmul <4 x float> %wide.load11, %broadcast.splat13
%35 = fadd <4 x float> %wide.load, %33
%36 = fadd <4 x float> %wide.load9, %34
store <4 x float> %35, <4 x float>* %26, align 4
store <4 x float> %36, <4 x float>* %28, align 4
%index.next = add i64 %index, 8
%37 = icmp eq i64 %index.next, %n.vec
br i1 %37, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body, %if
%resume.val = phi i64 [ 0, %if ], [ %n.vec, %vector.body ]
%cmp.n = icmp eq i64 %15, %resume.val
br i1 %cmp.n, label %L7, label %L
...
Timing of a function is easy to do:
x = rand(Float32, 10000000)
y = rand(Float32, 10000000)
a = float32(0.1)
@time axpy(a,x,y)
However, you probably want to run this more than once since the first time you call a function, Julia JITs it so the timing won’t be representative. Here are the timings of axpy with and without @simd
with length(x) == 10000000
:
@simd? | Time (s) |
---|---|
yes | 0.006527373 |
no | 0.013172804 |
Setup: Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX) Julia version: 0.3.11
Example 3: SAXPY + Implicit vectorisation
In my experiments, I found that it is sometimes possible to get implicit vectorisation by using just @inbounds
to disable bounds checking:
function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
n = length(x)
for i = 1:n
@inbounds y[i] += a*x[i]
end
end
This generates the same ASM as Example 2. Other simple cases have required @simd
as well, so explicit vectorisation seems to be the only option at the moment.
SIMD Limitations in Julia
- No SIMD functions. Function calls have to be inlined. Julia can manage to inline short functions itself.
- Code must be type-stable. This means that there isn’t implicit type conversion in the loop. This will prevent vectorisation and probably make it run slow serially too.
- SIMD macro doesn’t have the bells and whistles of the OpenMP SIMD pragma.
- Doesn’t appear to be any way to specify alignment of memory.
- No outer loop vectorisation.
- I can’t find any diagnostic information about how things were optimised.
- Does handle any branching besides some ifelse function.
More Info on Vectorisation
See Intel page Vectorization in Julia.
Case Study: Fitzhugh-Nagamo PDE
Experiment with a non-trivial example of a PDE solver using the reaction-diffusion system described by:
\[\begin{eqnarray} \frac{\partial u}{\partial t} &=& a \nabla^2 u + u - u^3 - v - k \\\\ \tau\frac{\partial v}{\partial t} &=& b\nabla^2 v + u - v \end{eqnarray}\]With Neumann boundary conditions, boundary of [1,1]^2, N=100^2, and T=[0.,10.]. This example is based on this online python example. I implement this in C++ and Julia and compare the performance and the vectorisation.
Here is the code in Julia:
function fitzhugh_nagumo(size, T)
# Define the constants
const aa = 2.8e-4
const bb = 5.0e-3
const τ = 0.1
const κ = -0.005
const dx = 2.0 / size
const dt = 0.9 * dx^2 / 2.0
const invdx2 = 1.0 / dx^2
const dt_τ = dt / τ
# Random initial fields
u_old = rand(Float64, (size,size))
v_old = rand(Float64, (size,size))
u_new = Array(Float64, (size,size))
v_new = Array(Float64, (size,size))
for t = 0.0 : dt : T
for j = 2:size-1
for i = 2:size-1
Δu = invdx2 * (u_old[i+1,j] + u_old[i-1,j] + u_old[i,j+1]
+ u_old[i,j-1] - 4*u_old[i,j])
Δv = invdx2 * (v_old[i+1,j] + v_old[i-1,j] + v_old[i,j+1]
+ v_old[i,j-1] - 4*v_old[i,j])
u_new[i,j] = u_old[i,j] + dt * (aa*Δu + u_old[i,j] - u_old[i,j]*u_old[i,j]*u_old[i,j]
- v_old[i,j] + κ)
v_new[i,j] = v_old[i,j] + dt_τ * (bb*Δv + u_old[i,j] - v_old[i,j])
end
end
#print_vars(u_new[1,2], v_new[1,2], 0, 0, t)
# fix neumann boundary conditions
for i = 1:size
u_new[i,1] = u_new[i,2]
u_new[i,size] = u_new[i,size-1]
v_new[i,1] = v_new[i,2]
v_new[i,size] = v_new[i,size-1]
end
for j = 1:size
u_new[1,j] = u_new[2,j]
u_new[size,j] = u_new[size-1,j]
v_new[1,j] = v_new[2,j]
v_new[size,j] = v_new[size-1,j]
end
# swap new and old
u_new, u_old = u_old, u_new
v_new, v_old = v_old, v_new
end
return (u_old, v_old)
end
Here is the code in C++:
#include <iostream>
#include <random>
#include <algorithm> // swap
#define N 100
std::random_device rd;
std::mt19937 mt(rd());
double fitzhugh_nagumo(double T)
{
// Define the constants
const double aa = 2.8e-4;
const double bb = 5.0e-3;
const double tau = 0.1;
const double kappa = -0.005;
const double dx = 2.0 / N;
const double dt = 0.9 * dx*dx / 2.0;
const double invdx2 = 1.0 / (dx*dx);
const double dt_tau = dt / tau;
// Random initial fields
double u_old[N][N];
double v_old[N][N];
double u_new[N][N];
double v_new[N][N];
std::uniform_real_distribution<double> rng(0.,1.);
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
u_old[i][j] = rng(mt);
v_old[i][j] = rng(mt);
}
}
// Solver
int Nt = (int) (T / dt);
std::cout << "Nt = " << Nt << std::endl;
for (int t = 0; t < Nt; ++t)
{
// evolve inner coordinates
for (int i = 1; i < N-1; ++i)
{
for (int j = 1; j < N-1; ++j)
{
double delta_u = invdx2 * (u_old[i+1][j] + u_old[i-1][j]
+ u_old[i][j+1] + u_old[i][j-1]
- 4*u_old[i][j]);
double delta_v = invdx2 * (v_old[i+1][j] + v_old[i-1][j]
+ v_old[i][j+1] + v_old[i][j-1]
- 4*v_old[i][j]);
u_new[i][j] = u_old[i][j] + dt * (aa*delta_u + u_old[i][j]
- u_old[i][j]*u_old[i][j]*u_old[i][j] - v_old[i][j]
+ kappa);
v_new[i]After i[j] = v_old[i][j] + dt_tau * (bb * delta_v + u_old[i][j] - v_old[i][j]);
}
}
// neumann boundary conditions
for (int i = 0; i < N; ++i)
{
u_new[i][0] = u_new[i][1];
u_new[i][N-1] = u_new[i][N-2];
v_new[i][0] = v_new[i][1];
v_new[i][N-1] = v_new[i][N-2];
}
for (int j = 0; j < N; ++j)
{
u_new[0][j] = u_new[1][j];
u_new[N-1][j] = u_new[N-2][j];
v_new[0][j] = v_new[1][j];
v_new[N-1][j] = v_new[N-2][j];
}
// Swap old and new
std::swap(u_new, u_old);
std::swap(v_new, v_old);
}
return u_old[0][0];
}
Results
Setup:
- Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX)
- Julia version: 0.3.10
- C++ Compiler: g++4.8
Language | Notes | Time (s) |
---|---|---|
Julia | None | 5.993 |
Julia | @inbounds | 3.105 |
Julia | @inbounds + @simd | 3.0603 |
C++ | -O0 -std=c++11 | 22.790 |
C++ | -Ofast -std=c++11 -fno-tree-vectorize -fno-tree-slp-vectorize | 3.2096 |
C++ | -Ofast -std=c++11 | 2.142 |
The best time in C++ is only 1.4x better than the best time in Julia. Inspecting the ASM of Julia + @inbounds + @simd shows that even with these macros Julia is still not generating vector instructions. :(. If I disable vectorisation in the C++ compiler, the times between Julia and C++ are much closer. This suggests that Julia could get even higher performance if it could generate vector instructions. I suppose that newer versions of Julia will improve this in the future. I find these results very impressive nonetheless. It will be worth trying with a newer LLVM version too.
Notes
The thing holding back performance in the Julia code was the use of u[i,j]^3
instead of u[i,j]*u[i,j]*u[i,j]
. The former version compiles to a call of pow(double, double)
from libm! The latter does what you expect. This is a known bug. Fixed when Julia is built against LLVM 3.6. Version I’m using is built with LLVM v3.4.