SIMD Support

Since version 0.3 Julia has some vectorisation capabilities that can exploit SIMD instructions when executing loops. It seems to require some nudging though. There are macros that are a bit like the pragmas in OpenMP.

Example 1. SAXPY

function saxpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  for i = 1:n
    y[i] += a*x[i]
  end
end

Sadly, in version 0.3.10 this obvious candidate does not auto-vectorise. You can inspect how this code has compiled nicely in Julia by using the macros @code_llvm to see the LLVM IR or @code_native to see the ASM. The ASM produced is:

Source line: 48
	mov	R8, QWORD PTR [RSI + 16]
	xor	R11D, R11D
	xor	ECX, ECX
	cmp	RCX, R8
	jae	65
	cmp	RCX, QWORD PTR [RDI + 16]
	jae	55
	lea	R10, QWORD PTR [4*R11]
	mov	RAX, QWORD PTR [RSI + 8]
	sub	RAX, R10
	mov	RDX, QWORD PTR [RDI + 8]
	sub	RDX, R10
	movss	XMM1, DWORD PTR [RDX]
	mulss	XMM1, XMM0
	addss	XMM1, DWORD PTR [RAX]
	movss	DWORD PTR [RAX], XMM1
	dec	R11
	inc	RCX
	cmp	R9, RCX
	jne	-72
	pop	RBP
	ret
	movabs	RAX, 140636405232096
	mov	RDI, QWORD PTR [RAX]
	movabs	RAX, 140636390845696
	mov	ESI, 48
	call	RAX

Note the scalar instructions movss, mulss, and addss.

Example 2: SAXPY + SIMD

To make it generate vectorised instructions you have to use the explicit vectorisation macros @simd and @inbounds macros. @simd gives the compiler license to vectorise without checking the legality of the transformation. @inbounds is an optimisation that turns off subscript checking, because subscript checking might throw an exception and so isn’t vectorisable.

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  @simd for i = 1:n
    @inbounds y[i] += a*x[i]
  end
end

This now compiles with SIMD instructions:

...
Source line: 48
	mov	R8, QWORD PTR [RDI + 8]
	mov	R9, QWORD PTR [RSI + 8]
	xor	EDI, EDI
	mov	RSI, RAX
	and	RSI, -8
	je	79
	pshufd	XMM1, XMM0, 0           # xmm1 = xmm0[0,0,0,0]
	xor	EDI, EDI
	lea	RCX, QWORD PTR [4*RDI]
	mov	RDX, R8
	sub	RDX, RCX
	movups	XMM3, XMMWORD PTR [RDX]
	movups	XMM2, XMMWORD PTR [RDX + 16]
	mulps	XMM3, XMM1
	mov	RDX, R9
	sub	RDX, RCX
	movups	XMM5, XMMWORD PTR [RDX]
	movups	XMM4, XMMWORD PTR [RDX + 16]
	addps	XMM5, XMM3
	movups	XMMWORD PTR [RDX], XMM5
	mulps	XMM2, XMM1
	addps	XMM2, XMM4
	movups	XMMWORD PTR [RDX + 16], XMM2
	add	RDI, -8
	mov	RCX, RSI
	add	RCX, RDI
	jne	-69
	mov	RDI, RSI
	sub	RAX, RDI
	je	41
...

Note the instructions movups, addps, mulps… (English: move, unaligned, packed, single-precision). Packed => Vector.

We can also see from the LLVM IR:


define void @julia_axpy_21664(float, %jl_value_t*, %jl_value_t*) {
...
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %25 = getelementptr float* %20, i64 %index, !dbg !4955
  %26 = bitcast float* %25 to <4 x float>*
  %wide.load = load <4 x float>* %26, align 4
  %.sum18 = or i64 %index, 4
  %27 = getelementptr float* %20, i64 %.sum18
  %28 = bitcast float* %27 to <4 x float>*
  %wide.load9 = load <4 x float>* %28, align 4
  %29 = getelementptr float* %24, i64 %index, !dbg !4955
  %30 = bitcast float* %29 to <4 x float>*
  %wide.load10 = load <4 x float>* %30, align 4
  %31 = getelementptr float* %24, i64 %.sum18
  %32 = bitcast float* %31 to <4 x float>*
  %wide.load11 = load <4 x float>* %32, align 4
  %33 = fmul <4 x float> %wide.load10, %broadcast.splat13
  %34 = fmul <4 x float> %wide.load11, %broadcast.splat13
  %35 = fadd <4 x float> %wide.load, %33
  %36 = fadd <4 x float> %wide.load9, %34
  store <4 x float> %35, <4 x float>* %26, align 4
  store <4 x float> %36, <4 x float>* %28, align 4
  %index.next = add i64 %index, 8
  %37 = icmp eq i64 %index.next, %n.vec
  br i1 %37, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if
  %resume.val = phi i64 [ 0, %if ], [ %n.vec, %vector.body ]
  %cmp.n = icmp eq i64 %15, %resume.val
  br i1 %cmp.n, label %L7, label %L

...

Timing of a function is easy to do:

x = rand(Float32, 10000000)
y = rand(Float32, 10000000)
a = float32(0.1)

@time axpy(a,x,y)

However, you probably want to run this more than once since the first time you call a function, Julia JITs it so the timing won’t be representative. Here are the timings of axpy with and without @simd with length(x) == 10000000:

@simd? Time (s)
yes 0.006527373
no 0.013172804

Setup: Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX) Julia version: 0.3.11

Example 3: SAXPY + Implicit vectorisation

In my experiments, I found that it is sometimes possible to get implicit vectorisation by using just @inbounds to disable bounds checking:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  for i = 1:n
    @inbounds y[i] += a*x[i]
  end
end

This generates the same ASM as Example 2. Other simple cases have required @simd as well, so explicit vectorisation seems to be the only option at the moment.

SIMD Limitations in Julia

  • No SIMD functions. Function calls have to be inlined. Julia can manage to inline short functions itself.
  • Code must be type-stable. This means that there isn’t implicit type conversion in the loop. This will prevent vectorisation and probably make it run slow serially too.
  • SIMD macro doesn’t have the bells and whistles of the OpenMP SIMD pragma.
  • Doesn’t appear to be any way to specify alignment of memory.
  • No outer loop vectorisation.
  • I can’t find any diagnostic information about how things were optimised.
  • Does handle any branching besides some ifelse function.

More Info on Vectorisation

See Intel page Vectorization in Julia.

Case Study: Fitzhugh-Nagamo PDE

Experiment with a non-trivial example of a PDE solver using the reaction-diffusion system described by:

\[\begin{eqnarray} \frac{\partial u}{\partial t} &=& a \nabla^2 u + u - u^3 - v - k \\\\ \tau\frac{\partial v}{\partial t} &=& b\nabla^2 v + u - v \end{eqnarray}\]

With Neumann boundary conditions, boundary of [1,1]^2, N=100^2, and T=[0.,10.]. This example is based on this online python example. I implement this in C++ and Julia and compare the performance and the vectorisation.

Here is the code in Julia:

function fitzhugh_nagumo(size, T)
  # Define the constants
  const aa = 2.8e-4
  const bb = 5.0e-3
  const τ = 0.1
  const κ = -0.005
  const dx = 2.0 / size
  const dt = 0.9 * dx^2 / 2.0
  const invdx2 = 1.0 / dx^2
  const dt_τ = dt / τ

  # Random initial fields
  u_old = rand(Float64, (size,size))
  v_old = rand(Float64, (size,size))

  u_new = Array(Float64, (size,size))
  v_new = Array(Float64, (size,size))
  
  for t = 0.0 : dt : T
    for j = 2:size-1
      for i = 2:size-1
        Δu = invdx2 * (u_old[i+1,j] + u_old[i-1,j] + u_old[i,j+1]
        				+ u_old[i,j-1] - 4*u_old[i,j])
        Δv = invdx2 * (v_old[i+1,j] + v_old[i-1,j] + v_old[i,j+1]
        				+ v_old[i,j-1] - 4*v_old[i,j])

        u_new[i,j] = u_old[i,j] + dt * (aa*Δu + u_old[i,j] - u_old[i,j]*u_old[i,j]*u_old[i,j] 
        								- v_old[i,j] + κ)
        v_new[i,j] = v_old[i,j] + dt_τ * (bb*Δv + u_old[i,j] - v_old[i,j])
      end
    end

    #print_vars(u_new[1,2], v_new[1,2], 0, 0, t)
    
    # fix neumann boundary conditions
    for i = 1:size
      u_new[i,1] = u_new[i,2]
      u_new[i,size] = u_new[i,size-1]
      v_new[i,1] = v_new[i,2]
      v_new[i,size] = v_new[i,size-1]
    end
    
    for j = 1:size
      u_new[1,j] = u_new[2,j]
      u_new[size,j] = u_new[size-1,j]
      v_new[1,j] = v_new[2,j]
      v_new[size,j] = v_new[size-1,j]
    end
      
    # swap new and old
    u_new, u_old = u_old, u_new
    v_new, v_old = v_old, v_new
  end
  
  return (u_old, v_old)
end

Here is the code in C++:

#include <iostream>
#include <random>
#include <algorithm> // swap

#define N 100

std::random_device rd;
std::mt19937 mt(rd());

double fitzhugh_nagumo(double T)
{
	// Define the constants
	const double aa = 2.8e-4;
	const double bb = 5.0e-3;
	const double tau = 0.1;
	const double kappa = -0.005;
	const double dx = 2.0 / N;
	const double dt = 0.9 * dx*dx / 2.0;
	const double invdx2 = 1.0 / (dx*dx);
	const double dt_tau = dt / tau;

	// Random initial fields
	double u_old[N][N];
	double v_old[N][N];
	double u_new[N][N];
	double v_new[N][N];

	std::uniform_real_distribution<double> rng(0.,1.);

	for (int i = 0; i < N; ++i)
	{
		for (int j = 0; j < N; ++j)
		{
			u_old[i][j] = rng(mt);
			v_old[i][j] = rng(mt);
		}
	}

	// Solver
	int Nt = (int) (T / dt);
	std::cout << "Nt = " << Nt << std::endl;
	for (int t = 0; t < Nt; ++t)
	{
		// evolve inner coordinates
		for (int i = 1; i < N-1; ++i)
		{
			for (int j = 1; j < N-1; ++j)
			{
				double delta_u = invdx2 * (u_old[i+1][j] + u_old[i-1][j] 
											+ u_old[i][j+1] + u_old[i][j-1] 
											- 4*u_old[i][j]);
				double delta_v = invdx2 * (v_old[i+1][j] + v_old[i-1][j]
											+ v_old[i][j+1] + v_old[i][j-1] 
											- 4*v_old[i][j]);

				u_new[i][j] = u_old[i][j] + dt * (aa*delta_u + u_old[i][j]
													- u_old[i][j]*u_old[i][j]*u_old[i][j] - v_old[i][j] 
													+ kappa);
				v_new[i]After i[j] = v_old[i][j] + dt_tau * (bb * delta_v + u_old[i][j] - v_old[i][j]);
			}
		}

		// neumann boundary conditions
		for (int i = 0; i < N; ++i)
		{
			u_new[i][0] = u_new[i][1];
			u_new[i][N-1] = u_new[i][N-2];
			v_new[i][0] = v_new[i][1];
			v_new[i][N-1] = v_new[i][N-2];
		}

		for (int j = 0; j < N; ++j)
		{
			u_new[0][j] = u_new[1][j];
			u_new[N-1][j] = u_new[N-2][j];
			v_new[0][j] = v_new[1][j];
			v_new[N-1][j] = v_new[N-2][j];
		}

		// Swap old and new
		std::swap(u_new, u_old);
		std::swap(v_new, v_old);
	}

	return u_old[0][0];
}

Results

Setup:

  • Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX)
  • Julia version: 0.3.10
  • C++ Compiler: g++4.8
Language Notes Time (s)
Julia None 5.993
Julia @inbounds 3.105
Julia @inbounds + @simd 3.0603
C++ -O0 -std=c++11 22.790
C++ -Ofast -std=c++11 -fno-tree-vectorize -fno-tree-slp-vectorize 3.2096
C++ -Ofast -std=c++11 2.142

The best time in C++ is only 1.4x better than the best time in Julia. Inspecting the ASM of Julia + @inbounds + @simd shows that even with these macros Julia is still not generating vector instructions. :(. If I disable vectorisation in the C++ compiler, the times between Julia and C++ are much closer. This suggests that Julia could get even higher performance if it could generate vector instructions. I suppose that newer versions of Julia will improve this in the future. I find these results very impressive nonetheless. It will be worth trying with a newer LLVM version too.

Notes

The thing holding back performance in the Julia code was the use of u[i,j]^3 instead of u[i,j]*u[i,j]*u[i,j]. The former version compiles to a call of pow(double, double) from libm! The latter does what you expect. This is a known bug. Fixed when Julia is built against LLVM 3.6. Version I’m using is built with LLVM v3.4.