go-seq | James Briggs' Blog

SIMD Support

Since version 0.3 Julia has some vectorisation capabilities that can exploit SIMD instructions when executing loops. It seems to require some nudging though. There are macros that are a bit like the pragmas in OpenMP.

Example 1. SAXPY

function saxpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  for i = 1:n
    y[i] += a*x[i]
  end
end

Sadly, in version 0.3.10 this obvious candidate does not auto-vectorise. You can inspect how this code has compiled nicely in Julia by using the macros @code_llvm to see the LLVM IR or @code_native to see the ASM. The ASM produced is:

Source line: 48
	mov	R8, QWORD PTR [RSI + 16]
	xor	R11D, R11D
	xor	ECX, ECX
	cmp	RCX, R8
	jae	65
	cmp	RCX, QWORD PTR [RDI + 16]
	jae	55
	lea	R10, QWORD PTR [4*R11]
	mov	RAX, QWORD PTR [RSI + 8]
	sub	RAX, R10
	mov	RDX, QWORD PTR [RDI + 8]
	sub	RDX, R10
	movss	XMM1, DWORD PTR [RDX]
	mulss	XMM1, XMM0
	addss	XMM1, DWORD PTR [RAX]
	movss	DWORD PTR [RAX], XMM1
	dec	R11
	inc	RCX
	cmp	R9, RCX
	jne	-72
	pop	RBP
	ret
	movabs	RAX, 140636405232096
	mov	RDI, QWORD PTR [RAX]
	movabs	RAX, 140636390845696
	mov	ESI, 48
	call	RAX

Note the scalar instructions movss, mulss, and addss.

Example 2: SAXPY + SIMD

To make it generate vectorised instructions you have to use the explicit vectorisation macros @simd and @inbounds macros. @simd gives the compiler license to vectorise without checking the legality of the transformation. @inbounds is an optimisation that turns off subscript checking, because subscript checking might throw an exception and so isn’t vectorisable.

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  @simd for i = 1:n
    @inbounds y[i] += a*x[i]
  end
end

This now compiles with SIMD instructions:

...
Source line: 48
	mov	R8, QWORD PTR [RDI + 8]
	mov	R9, QWORD PTR [RSI + 8]
	xor	EDI, EDI
	mov	RSI, RAX
	and	RSI, -8
	je	79
	pshufd	XMM1, XMM0, 0           # xmm1 = xmm0[0,0,0,0]
	xor	EDI, EDI
	lea	RCX, QWORD PTR [4*RDI]
	mov	RDX, R8
	sub	RDX, RCX
	movups	XMM3, XMMWORD PTR [RDX]
	movups	XMM2, XMMWORD PTR [RDX + 16]
	mulps	XMM3, XMM1
	mov	RDX, R9
	sub	RDX, RCX
	movups	XMM5, XMMWORD PTR [RDX]
	movups	XMM4, XMMWORD PTR [RDX + 16]
	addps	XMM5, XMM3
	movups	XMMWORD PTR [RDX], XMM5
	mulps	XMM2, XMM1
	addps	XMM2, XMM4
	movups	XMMWORD PTR [RDX + 16], XMM2
	add	RDI, -8
	mov	RCX, RSI
	add	RCX, RDI
	jne	-69
	mov	RDI, RSI
	sub	RAX, RDI
	je	41
...

Note the instructions movups, addps, mulps… (English: move, unaligned, packed, single-precision). Packed => Vector.

We can also see from the LLVM IR:

define void @julia_axpy_21664(float, %jl_value_t*, %jl_value_t*) {
...
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %25 = getelementptr float* %20, i64 %index, !dbg !4955
  %26 = bitcast float* %25 to <4 x float>*
  %wide.load = load <4 x float>* %26, align 4
  %.sum18 = or i64 %index, 4
  %27 = getelementptr float* %20, i64 %.sum18
  %28 = bitcast float* %27 to <4 x float>*
  %wide.load9 = load <4 x float>* %28, align 4
  %29 = getelementptr float* %24, i64 %index, !dbg !4955
  %30 = bitcast float* %29 to <4 x float>*
  %wide.load10 = load <4 x float>* %30, align 4
  %31 = getelementptr float* %24, i64 %.sum18
  %32 = bitcast float* %31 to <4 x float>*
  %wide.load11 = load <4 x float>* %32, align 4
  %33 = fmul <4 x float> %wide.load10, %broadcast.splat13
  %34 = fmul <4 x float> %wide.load11, %broadcast.splat13
  %35 = fadd <4 x float> %wide.load, %33
  %36 = fadd <4 x float> %wide.load9, %34
  store <4 x float> %35, <4 x float>* %26, align 4
  store <4 x float> %36, <4 x float>* %28, align 4
  %index.next = add i64 %index, 8
  %37 = icmp eq i64 %index.next, %n.vec
  br i1 %37, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if
  %resume.val = phi i64 [ 0, %if ], [ %n.vec, %vector.body ]
  %cmp.n = icmp eq i64 %15, %resume.val
  br i1 %cmp.n, label %L7, label %L

...

Timing of a function is easy to do:

x = rand(Float32, 10000000)
y = rand(Float32, 10000000)
a = float32(0.1)

@time axpy(a,x,y)

However, you probably want to run this more than once since the first time you call a function, Julia JITs it so the timing won’t be representative. Here are the timings of axpy with and without @simd with length(x) == 10000000:

@simd?	Time (s)
yes	0.006527373
no	0.013172804

Setup: Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX) Julia version: 0.3.11

Example 3: SAXPY + Implicit vectorisation

In my experiments, I found that it is sometimes possible to get implicit vectorisation by using just @inbounds to disable bounds checking:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
  n = length(x)
  for i = 1:n
    @inbounds y[i] += a*x[i]
  end
end

This generates the same ASM as Example 2. Other simple cases have required @simd as well, so explicit vectorisation seems to be the only option at the moment.

SIMD Limitations in Julia

No SIMD functions. Function calls have to be inlined. Julia can manage to inline short functions itself.
Code must be type-stable. This means that there isn’t implicit type conversion in the loop. This will prevent vectorisation and probably make it run slow serially too.
SIMD macro doesn’t have the bells and whistles of the OpenMP SIMD pragma.
Doesn’t appear to be any way to specify alignment of memory.
No outer loop vectorisation.
I can’t find any diagnostic information about how things were optimised.
Does handle any branching besides some ifelse function.

More Info on Vectorisation

See Intel page Vectorization in Julia.

Case Study: Fitzhugh-Nagamo PDE

Experiment with a non-trivial example of a PDE solver using the reaction-diffusion system described by:

\[\begin{eqnarray} \frac{\partial u}{\partial t} &=& a \nabla^2 u + u - u^3 - v - k \\\\ \tau\frac{\partial v}{\partial t} &=& b\nabla^2 v + u - v \end{eqnarray}\]

With Neumann boundary conditions, boundary of [1,1]^2, N=100^2, and T=[0.,10.]. This example is based on this online python example. I implement this in C++ and Julia and compare the performance and the vectorisation.

Here is the code in Julia:

function fitzhugh_nagumo(size, T)
  # Define the constants
  const aa = 2.8e-4
  const bb = 5.0e-3
  const τ = 0.1
  const κ = -0.005
  const dx = 2.0 / size
  const dt = 0.9 * dx^2 / 2.0
  const invdx2 = 1.0 / dx^2
  const dt_τ = dt / τ

  # Random initial fields
  u_old = rand(Float64, (size,size))
  v_old = rand(Float64, (size,size))

  u_new = Array(Float64, (size,size))
  v_new = Array(Float64, (size,size))
  
  for t = 0.0 : dt : T
    for j = 2:size-1
      for i = 2:size-1
        Δu = invdx2 * (u_old[i+1,j] + u_old[i-1,j] + u_old[i,j+1]
        				+ u_old[i,j-1] - 4*u_old[i,j])
        Δv = invdx2 * (v_old[i+1,j] + v_old[i-1,j] + v_old[i,j+1]
        				+ v_old[i,j-1] - 4*v_old[i,j])

        u_new[i,j] = u_old[i,j] + dt * (aa*Δu + u_old[i,j] - u_old[i,j]*u_old[i,j]*u_old[i,j] 
        								- v_old[i,j] + κ)
        v_new[i,j] = v_old[i,j] + dt_τ * (bb*Δv + u_old[i,j] - v_old[i,j])
      end
    end

    #print_vars(u_new[1,2], v_new[1,2], 0, 0, t)
    
    # fix neumann boundary conditions
    for i = 1:size
      u_new[i,1] = u_new[i,2]
      u_new[i,size] = u_new[i,size-1]
      v_new[i,1] = v_new[i,2]
      v_new[i,size] = v_new[i,size-1]
    end
    
    for j = 1:size
      u_new[1,j] = u_new[2,j]
      u_new[size,j] = u_new[size-1,j]
      v_new[1,j] = v_new[2,j]
      v_new[size,j] = v_new[size-1,j]
    end
      
    # swap new and old
    u_new, u_old = u_old, u_new
    v_new, v_old = v_old, v_new
  end
  
  return (u_old, v_old)
end

Here is the code in C++:

#include <iostream>
#include <random>
#include <algorithm> // swap

#define N 100

std::random_device rd;
std::mt19937 mt(rd());

double fitzhugh_nagumo(double T)
{
	// Define the constants
	const double aa = 2.8e-4;
	const double bb = 5.0e-3;
	const double tau = 0.1;
	const double kappa = -0.005;
	const double dx = 2.0 / N;
	const double dt = 0.9 * dx*dx / 2.0;
	const double invdx2 = 1.0 / (dx*dx);
	const double dt_tau = dt / tau;

	// Random initial fields
	double u_old[N][N];
	double v_old[N][N];
	double u_new[N][N];
	double v_new[N][N];

	std::uniform_real_distribution<double> rng(0.,1.);

	for (int i = 0; i < N; ++i)
	{
		for (int j = 0; j < N; ++j)
		{
			u_old[i][j] = rng(mt);
			v_old[i][j] = rng(mt);
		}
	}

	// Solver
	int Nt = (int) (T / dt);
	std::cout << "Nt = " << Nt << std::endl;
	for (int t = 0; t < Nt; ++t)
	{
		// evolve inner coordinates
		for (int i = 1; i < N-1; ++i)
		{
			for (int j = 1; j < N-1; ++j)
			{
				double delta_u = invdx2 * (u_old[i+1][j] + u_old[i-1][j] 
											+ u_old[i][j+1] + u_old[i][j-1] 
											- 4*u_old[i][j]);
				double delta_v = invdx2 * (v_old[i+1][j] + v_old[i-1][j]
											+ v_old[i][j+1] + v_old[i][j-1] 
											- 4*v_old[i][j]);

				u_new[i][j] = u_old[i][j] + dt * (aa*delta_u + u_old[i][j]
													- u_old[i][j]*u_old[i][j]*u_old[i][j] - v_old[i][j] 
													+ kappa);
				v_new[i]After i[j] = v_old[i][j] + dt_tau * (bb * delta_v + u_old[i][j] - v_old[i][j]);
			}
		}

		// neumann boundary conditions
		for (int i = 0; i < N; ++i)
		{
			u_new[i][0] = u_new[i][1];
			u_new[i][N-1] = u_new[i][N-2];
			v_new[i][0] = v_new[i][1];
			v_new[i][N-1] = v_new[i][N-2];
		}

		for (int j = 0; j < N; ++j)
		{
			u_new[0][j] = u_new[1][j];
			u_new[N-1][j] = u_new[N-2][j];
			v_new[0][j] = v_new[1][j];
			v_new[N-1][j] = v_new[N-2][j];
		}

		// Swap old and new
		std::swap(u_new, u_old);
		std::swap(v_new, v_old);
	}

	return u_old[0][0];
}

Results

Setup:

Hardware: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (AVX)
Julia version: 0.3.10
C++ Compiler: g++4.8

Language	Notes	Time (s)
Julia	None	5.993
Julia	@inbounds	3.105
Julia	@inbounds + @simd	3.0603
C++	-O0 -std=c++11	22.790
C++	-Ofast -std=c++11 -fno-tree-vectorize -fno-tree-slp-vectorize	3.2096
C++	-Ofast -std=c++11	2.142

The best time in C++ is only 1.4x better than the best time in Julia. Inspecting the ASM of Julia + @inbounds + @simd shows that even with these macros Julia is still not generating vector instructions. :(. If I disable vectorisation in the C++ compiler, the times between Julia and C++ are much closer. This suggests that Julia could get even higher performance if it could generate vector instructions. I suppose that newer versions of Julia will improve this in the future. I find these results very impressive nonetheless. It will be worth trying with a newer LLVM version too.

Notes

The thing holding back performance in the Julia code was the use of u[i,j]^3 instead of u[i,j]*u[i,j]*u[i,j]. The former version compiles to a call of pow(double, double) from libm! The latter does what you expect. This is a known bug. Fixed when Julia is built against LLVM 3.6. Version I’m using is built with LLVM v3.4.

Experiments with Julia