September | 2013 | Programming with Jim

I’m experimenting with SSE SIMD instructions to speed up graphics calculations. I wrote some test code to measure the speedup. The code allocates two float arrays and adds the second array to the first. The trial is repeated many times with the same blocks of memory, so allocation isn’t a factor. To keep things fair, the non-SIMD loop still operates in blocks of 4 so its loop overhead is equal.

The vanilla floating-point loop looks like this:

	float *a1, *a2; //allocated elsewhere

	float * const end = a1 + size;
	float *a = a1;
	float *b = a2;

	for (; a != end; )
	{
		*a++ += *b++;
		*a++ += *b++;
		*a++ += *b++;
		*a++ += *b++;
	}

The SSE loop looks like this:

	// same setup as before

	for (; a != end; a += 4, b += 4)
	{
	    __m128 ax = _mm_load_ps(a);
	    __m128 bx = _mm_load_ps(b);
	    ax = _mm_add_ps(ax, bx);
	    _mm_store_ps(a, ax);
	}

Simple, right? The full source code, including memory allocation and timing, is at the bottom of the post.

Each SSE instruction processes 4 floats at a time, so the expected speedup is 4x. I ran tests on arrays from 12 to 10,000,000 in size. Here are the results:

Small arrays still benefit but there’s a peak around 1000 elements. After that, it deteriorates until it’s less than 1.25x for big arrays.

I suppose this – like so, so many performance issues – is related to the cache. My Sandy Bridge i7 laptop CPU has the following caches:

L1	32kB	instruction	per core
L1	32kB	data	per core
L2	256kB	instruction + data	per core
L3	6MB	instruction + data	shared

At 1000 elements, each array contains 4kB. Both arrays, 8kB total, fit comfortably within the L1 data cache. Even when we loop through multiple trials, our whole data set stays cached. But at 10000 elements, the 80kb total spills into the L2 cache and performance begins to slip. As the size increases, our caches become virtually useless. Our memory access patterns become anti-optimal: we load stuff into the cache, use it once, and never come back to it until the next trial, long after other operations have pushed it out. By this point we’re idling for hundreds of cycles while we wait for main memory. The difference in floating point execution has a minimal effect.

There’s not a whole lot we can do if our only task is adding together huge arrays once. But all is not lost! If we have a longer sequence of operations – for example, a sequence of vector additions, subtractions, normalizations, and dot products in a graphics shader – we might still gain a lot by working on cache-sized chunks. In the next post, I’ll run a benchmark on SIMD instructions for that type of workload. I have no idea what the results will be…

Here is the full source code of my test if you’d like to compare results on your machine. Sorry for the Windows specific stuff.

#include <iostream>
#include <sstream>
#include <xmmintrin.h>
#include <Windows.h>

int main(int argc, char **argv)
{
	if (argc < 2)
	{
		std::cout << "add a size argument" << std::endl;
		return 1;
	}

	std::stringstream ss;
	ss << argv[1];
	long size;
	ss >> size;

	long const trials = 1000000000 / size;

	float *a1 = (float *)_aligned_malloc(
	    size * sizeof(float), sizeof(__m128));
	float *a2 = (float *)_aligned_malloc(
	    size * sizeof(float), sizeof(__m128));

	LARGE_INTEGER reg0, reg1, regtot;
	LARGE_INTEGER sse0, sse1, ssetot;

	QueryPerformanceCounter(&reg0);
	for (long trial = 0; trial < trials; ++trial)
	{
		float * const end = a1 + size;
		float *a = a1;
		float *b = a2;

		for (; a != end; )
		{
			*a++ += *b++;
			*a++ += *b++;
			*a++ += *b++;
			*a++ += *b++;
		}
	}
	QueryPerformanceCounter(&;reg1);
	regtot.QuadPart = reg1.QuadPart - reg0.QuadPart;

	QueryPerformanceCounter(&sse0);
	for (long trial = 0; trial < trials; ++trial)
	{
		float * const end = a1 + size;
		float *a = a1;
		float *b = a2;

	    for (; a != end; a += 4, b += 4)
	    {
	        __m128 ax = _mm_load_ps(a);
	        __m128 bx = _mm_load_ps(b);
	        ax = _mm_add_ps(ax, bx);
	        _mm_store_ps(a, ax);
	    }
	}
	QueryPerformanceCounter(&sse1);
	ssetot.QuadPart = sse1.QuadPart - sse0.QuadPart;

	std::cout << "Regular: " << regtot.QuadPart << std::endl;
	std::cout << "SSE: " << ssetot.QuadPart << std::endl;
	std::cout << "Ratio: " <<
	    float(regtot.QuadPart) / ssetot.QuadPart << std::endl;

	_aligned_free(a1);
	_aligned_free(a2);
}

I really like Python’s multiple return values, aka automatic tuple unpacking:

t, p = project_point(curve, point)
# t is the parameter value of the projected point, p is the actual point

In C++ we must do something ugly with references like this:

Point p;
double t = project_point(curve, point, p);

Or use tuples everywhere:

std::tuple<double, Point> tp = project_point(curve, point);

If C++ copied Python’s syntax, we might see something like this:

Point p, double t = project_point(curve, point);

But that could get ugly fast with longer type names. Instead, what about the following syntax:

empty Point p; // Point() constructor isn't called
empty double t;
t, p = project_point(curve, point);

I like it. We’d have to overload some reserved word instead of using empty, but I’m sure the committee could find a way to make it work.

Programming with Jim

Monthly Archives: September 2013

SIMD won’t help if the cache isn’t hot

C++ Wish List: Multiple return values