SIMD Support

Since version 0.3 Julia has some vectorisation capabilities that can exploit SIMD instructions when executing loops. It seems to require some nudging though. There are macros that are a bit like the pragmas in OpenMP.

Example 1. SAXPY

function saxpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  for i = 1:n
    y[i] += a*x[i]
  end
end

Sadly, in version 0.3.10 this obvious candidate does not auto-vectorise. You can inspect how this code has compiled nicely in Julia by using the macros @code_llvm to see the LLVM IR or @code_native to see the ASM. The ASM produced is:

Source line: 48
    mov	R8, QWORD PTR [RSI + 16]
    xor	R11D, R11D
    xor	ECX, ECX
    cmp	RCX, R8
    jae	65
    cmp	RCX, QWORD PTR [RDI + 16]
    jae	55
    lea	R10, QWORD PTR [4*R11]
    mov	RAX, QWORD PTR [RSI + 8]
    sub	RAX, R10
    mov	RDX, QWORD PTR [RDI + 8]
    sub	RDX, R10
    movss	XMM1, DWORD PTR [RDX]
    mulss	XMM1, XMM0
    addss	XMM1, DWORD PTR [RAX]
    movss	DWORD PTR [RAX], XMM1
    dec	R11
    inc	RCX
    cmp	R9, RCX
    jne	-72
    pop	RBP
    ret
    movabs	RAX, 140636405232096
    mov	RDI, QWORD PTR [RAX]
    movabs	RAX, 140636390845696
    mov	ESI, 48
    call	RAX

Note the scalar instructions movss, mulss, and addss.

Example 2: SAXPY + SIMD

To make it generate vectorised instructions you have to use the explicit vectorisation macros @simd and @inbounds macros. @simd gives the compiler license to vectorise without checking the legality of the transformation. @inbounds is an optimisation that turns off subscript checking, because subscript checking might throw an exception and so isn’t vectorisable.

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  @simd for i = 1:n
    @inbounds y[i] += a*x[i]
  end
end

This now compiles with SIMD instructions:

...
Source line: 48
    mov	R8, QWORD PTR [RDI + 8]
    mov	R9, QWORD PTR [RSI + 8]
    xor	EDI, EDI
    mov	RSI, RAX
    and	RSI, -8
    je	79
    pshufd	XMM1, XMM0, 0           # xmm1 = xmm0[0,0,0,0]
    xor	EDI, EDI
    lea	RCX, QWORD PTR [4*RDI]
    mov	RDX, R8
    sub	RDX, RCX
    movups	XMM3, XMMWORD PTR [RDX]
    movups	XMM2, XMMWORD PTR [RDX + 16]
    mulps	XMM3, XMM1
    mov	RDX, R9
    sub	RDX, RCX
    movups	XMM5, XMMWORD PTR [RDX]
    movups	XMM4, XMMWORD PTR [RDX + 16]
    addps	XMM5, XMM3
    movups	XMMWORD PTR [RDX], XMM5
    mulps	XMM2, XMM1
    addps	XMM2, XMM4
    movups	XMMWORD PTR [RDX + 16], XMM2
    add	RDI, -8
    mov	RCX, RSI
    add	RCX, RDI
    jne	-69
    mov	RDI, RSI
    sub	RAX, RDI
    je	41
...

Note the instructions movups, addps, mulps… (English: move, unaligned, packed, single-precision). Packed => Vector.

We can also see from the LLVM IR:


define void @julia_axpy_21664(float, %jl_value_t*, %jl_value_t*) {
...
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %25 = getelementptr float* %20, i64 %index, !dbg !4955
  %26 = bitcast float* %25 to <4 x float>*
  %wide.load = load <4 x float>* %26, align 4
  %.sum18 = or i64 %index, 4
  %27 = getelementptr float* %20, i64 %.sum18
  %28 = bitcast float* %27 to <4 x float>*
  %wide.load9 = load <4 x float>* %28, align 4
  %29 = getelementptr float* %24, i64 %index, !dbg !4955
  %30 = bitcast float* %29 to <4 x float>*
  %wide.load10 = load <4 x float>* %30, align 4
  %31 = getelementptr float* %24, i64 %.sum18
  %32 = bitcast float* %31 to <4 x float>*
  %wide.load11 = load <4 x float>* %32, align 4
  %33 = fmul <4 x float> %wide.load10, %broadcast.splat13
  %34 = fmul <4 x float> %wide.load11, %broadcast.splat13
  %35 = fadd <4 x float> %wide.load, %33
  %36 = fadd <4 x float> %wide.load9, %34
  store <4 x float> %35, <4 x float>* %26, align 4
  store <4 x float> %36, <4 x float>* %28, align 4
  %index.next = add i64 %index, 8
  %37 = icmp eq i64 %index.next, %n.vec
  br i1 %37, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if
  %resume.val = phi i64 [ 0, %if ], [ %n.vec, %vector.body ]
  %cmp.n = icmp eq i64 %15, %resume.val
  br i1 %cmp.n, label %L7, label %L

...

Timing of a function is easy to do:

x = rand(Float32, 10000000)
y = rand(Float32, 10000000)
a = float32(0.1)

@time axpy(a,x,y)

However, you probably want to run this more than once since the first time you call a function, Julia JITs it so the timing won’t be representative. Here are the timings of axpy with and without @simd with length(x) == 10000000:

@simd? Time (s)
yes 0.006527373
no 0.013172804

Setup: Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX) Julia version: 0.3.11

Example 3: SAXPY + Implicit vectorisation

In my experiments, I found that it is sometimes possible to get implicit vectorisation by using just @inbounds to disable bounds checking:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  for i = 1:n
    @inbounds y[i] += a*x[i]
  end
end

This generates the same ASM as Example 2. Other simple cases have required @simd as well, so explicit vectorisation seems to be the only option at the moment.

SIMD Limitations in Julia

  • No SIMD functions. Function calls have to be inlined. Julia can manage to inline short functions itself.
  • Code must be type-stable. This means that there isn’t implicit type conversion in the loop. This will prevent vectorisation and probably make it run slow serially too.
  • SIMD macro doesn’t have the bells and whistles of the OpenMP SIMD pragma.
  • Doesn’t appear to be any way to specify alignment of memory.
  • No outer loop vectorisation.
  • I can’t find any diagnostic information about how things were optimised.
  • Does handle any branching besides some ifelse function.

More Info on Vectorisation

See Intel page Vectorization in Julia.

Case Study: Fitzhugh-Nagamo PDE

Experiment with a non-trivial example of a PDE solver using the reaction-diffusion system described by:

\[\begin{eqnarray} \frac{\partial u}{\partial t} &=& a \nabla^2 u + u - u^3 - v - k \\ \tau\frac{\partial v}{\partial t} &=& b\nabla^2 v + u - v \end{eqnarray}\]

With Neumann boundary conditions, boundary of [1,1]^2, N=100^2, and T=[0.,10.]. This example is based on this online python example. I implement this in C++ and Julia and compare the performance and the vectorisation. $y=x^2$.

Here is the code in Julia:

Click to expand code…
function fitzhugh_nagumo(size, T)
  # Define the constants
  const aa = 2.8e-4
  const bb = 5.0e-3
  const τ = 0.1
  const κ = -0.005
  const dx = 2.0 / size
  const dt = 0.9 * dx^2 / 2.0
  const invdx2 = 1.0 / dx^2
  const dt_τ = dt / τ

  # Random initial fields
  u_old = rand(Float64, (size,size))
  v_old = rand(Float64, (size,size))

  u_new = Array(Float64, (size,size))
  v_new = Array(Float64, (size,size))
  
  for t = 0.0 : dt : T
    for j = 2:size-1
      for i = 2:size-1
        Δu = invdx2 * (u_old[i+1,j] + u_old[i-1,j] + u_old[i,j+1]
                        + u_old[i,j-1] - 4*u_old[i,j])
        Δv = invdx2 * (v_old[i+1,j] + v_old[i-1,j] + v_old[i,j+1]
                        + v_old[i,j-1] - 4*v_old[i,j])

        u_new[i,j] = u_old[i,j] + dt * (aa*Δu + u_old[i,j] - u_old[i,j]*u_old[i,j]*u_old[i,j] 
                                        - v_old[i,j] + κ)
        v_new[i,j] = v_old[i,j] + dt_τ * (bb*Δv + u_old[i,j] - v_old[i,j])
      end
    end

    for i = 1:size
      u_new[i,1] = u_new[i,2]
      u_new[i,size] = u_new[i,size-1]
      v_new[i,1] = v_new[i,2]
      v_new[i,size] = v_new[i,size-1]
    end
    
    for j = 1:size
      u_new[1,j] = u_new[2,j]
      u_new[size,j] = u_new[size-1,j]
      v_new[1,j] = v_new[2,j]
      v_new[size,j] = v_new[size-1,j]
    end
      
    # swap new and old
    u_new, u_old = u_old, u_new
    v_new, v_old = v_old, v_new
  end
  
  return (u_old, v_old)
end

Here is the code in C++.

Click to expand code…
#include <iostream>
#include <random>
#include <algorithm> // swap

#define N 100

std::random_device rd;
std::mt19937 mt(rd());

double fitzhugh_nagumo(double T)
{
  // Define the constants
  const double aa = 2.8e-4;
  const double bb = 5.0e-3;
  const double tau = 0.1;
  const double kappa = -0.005;
  const double dx = 2.0 / N;
  const double dt = 0.9 * dx*dx / 2.0;
  const double invdx2 = 1.0 / (dx*dx);
  const double dt_tau = dt / tau;

  // Random initial fields
  double u_old[N][N];
  double v_old[N][N];
  double u_new[N][N];
  double v_new[N][N];

  std::uniform_real_distribution<double> rng(0.,1.);

  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
      u_old[i][j] = rng(mt);
      v_old[i][j] = rng(mt);
    }
  }

  // Solver
  int Nt = (int) (T / dt);
  std::cout << "Nt = " << Nt << std::endl;
  for (int t = 0; t < Nt; ++t) {
    // evolve inner coordinates
    for (int i = 1; i < N-1; ++i) {
      for (int j = 1; j < N-1; ++j) {
        double delta_u = invdx2 * (u_old[i+1][j] + u_old[i-1][j] 
                                   + u_old[i][j+1] + u_old[i][j-1] 
                                   - 4*u_old[i][j]);
        double delta_v = invdx2 * (v_old[i+1][j] + v_old[i-1][j]
                                   + v_old[i][j+1] + v_old[i][j-1] 
                                   - 4*v_old[i][j]);

        u_new[i][j] = u_old[i][j] + dt * (aa*delta_u + u_old[i][j]
                                          - u_old[i][j]*u_old[i][j]*u_old[i][j]
                                          - v_old[i][j] + kappa);
        v_new[i]After i[j] = v_old[i][j] + dt_tau * (bb * delta_v + u_old[i][j] - v_old[i][j]);
      }
    }

    // neumann boundary conditions
    for (int i = 0; i < N; ++i) {
      u_new[i][0] = u_new[i][1];
      u_new[i][N-1] = u_new[i][N-2];
      v_new[i][0] = v_new[i][1];
      v_new[i][N-1] = v_new[i][N-2];
    }

    for (int j = 0; j < N; ++j) {
      u_new[0][j] = u_new[1][j];
      u_new[N-1][j] = u_new[N-2][j];
      v_new[0][j] = v_new[1][j];
      v_new[N-1][j] = v_new[N-2][j];
    }

    // Swap old and new
    std::swap(u_new, u_old);
    std::swap(v_new, v_old);
  }

  return u_old[0][0];
}

Results

Setup:

  • Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX)
  • Julia version: 0.3.10
  • C++ Compiler: g++4.8
Language Notes Time (s)
Julia None 5.993
Julia @inbounds 3.105
Julia @inbounds + @simd 3.0603
C++ -O0 -std=c++11 22.790
C++ -Ofast -std=c++11 -fno-tree-vectorize -fno-tree-slp-vectorize 3.2096
C++ -Ofast -std=c++11 2.142

The best time in C++ is only 1.4x better than the best time in Julia. Inspecting the ASM of Julia + @inbounds + @simd shows that even with these macros Julia is still not generating vector instructions. :(. If I disable vectorisation in the C++ compiler, the times between Julia and C++ are much closer. This suggests that Julia could get even higher performance if it could generate vector instructions. I suppose that newer versions of Julia will improve this in the future. I find these results very impressive nonetheless. It will be worth trying with a newer LLVM version too.

Notes

The thing holding back performance in the Julia code was the use of u[i,j]^3 instead of u[i,j]*u[i,j]*u[i,j]. The former version compiles to a call of pow(double, double) from libm! The latter does what you expect. This is a known bug. Fixed when Julia is built against LLVM 3.6. Version I’m using is built with LLVM v3.4.