I’m experimenting with SSE SIMD instructions to speed up graphics calculations. I wrote some test code to measure the speedup. The code allocates two float arrays and adds the second array to the first. The trial is repeated many times with the same blocks of memory, so allocation isn’t a factor. To keep things fair, the non-SIMD loop still operates in blocks of 4 so its loop overhead is equal.
The vanilla floating-point loop looks like this:
float *a1, *a2; //allocated elsewhere
float * const end = a1 + size;
float *a = a1;
float *b = a2;
for (; a != end; )
{
*a++ += *b++;
*a++ += *b++;
*a++ += *b++;
*a++ += *b++;
}
The SSE loop looks like this:
// same setup as before
for (; a != end; a += 4, b += 4)
{
__m128 ax = _mm_load_ps(a);
__m128 bx = _mm_load_ps(b);
ax = _mm_add_ps(ax, bx);
_mm_store_ps(a, ax);
}
Simple, right? The full source code, including memory allocation and timing, is at the bottom of the post.
Each SSE instruction processes 4 floats at a time, so the expected speedup is 4x. I ran tests on arrays from 12 to 10,000,000 in size. Here are the results:
Small arrays still benefit but there’s a peak around 1000 elements. After that, it deteriorates until it’s less than 1.25x for big arrays.
I suppose this – like so, so many performance issues – is related to the cache. My Sandy Bridge i7 laptop CPU has the following caches:
| L1 | 32kB | instruction | per core |
| L1 | 32kB | data | per core |
| L2 | 256kB | instruction + data | per core |
| L3 | 6MB | instruction + data | shared |
At 1000 elements, each array contains 4kB. Both arrays, 8kB total, fit comfortably within the L1 data cache. Even when we loop through multiple trials, our whole data set stays cached. But at 10000 elements, the 80kb total spills into the L2 cache and performance begins to slip. As the size increases, our caches become virtually useless. Our memory access patterns become anti-optimal: we load stuff into the cache, use it once, and never come back to it until the next trial, long after other operations have pushed it out. By this point we’re idling for hundreds of cycles while we wait for main memory. The difference in floating point execution has a minimal effect.
There’s not a whole lot we can do if our only task is adding together huge arrays once. But all is not lost! If we have a longer sequence of operations – for example, a sequence of vector additions, subtractions, normalizations, and dot products in a graphics shader – we might still gain a lot by working on cache-sized chunks. In the next post, I’ll run a benchmark on SIMD instructions for that type of workload. I have no idea what the results will be…
Here is the full source code of my test if you’d like to compare results on your machine. Sorry for the Windows specific stuff.
#include <iostream>
#include <sstream>
#include <xmmintrin.h>
#include <Windows.h>
int main(int argc, char **argv)
{
if (argc < 2)
{
std::cout << "add a size argument" << std::endl;
return 1;
}
std::stringstream ss;
ss << argv[1];
long size;
ss >> size;
long const trials = 1000000000 / size;
float *a1 = (float *)_aligned_malloc(
size * sizeof(float), sizeof(__m128));
float *a2 = (float *)_aligned_malloc(
size * sizeof(float), sizeof(__m128));
LARGE_INTEGER reg0, reg1, regtot;
LARGE_INTEGER sse0, sse1, ssetot;
QueryPerformanceCounter(®0);
for (long trial = 0; trial < trials; ++trial)
{
float * const end = a1 + size;
float *a = a1;
float *b = a2;
for (; a != end; )
{
*a++ += *b++;
*a++ += *b++;
*a++ += *b++;
*a++ += *b++;
}
}
QueryPerformanceCounter(&;reg1);
regtot.QuadPart = reg1.QuadPart - reg0.QuadPart;
QueryPerformanceCounter(&sse0);
for (long trial = 0; trial < trials; ++trial)
{
float * const end = a1 + size;
float *a = a1;
float *b = a2;
for (; a != end; a += 4, b += 4)
{
__m128 ax = _mm_load_ps(a);
__m128 bx = _mm_load_ps(b);
ax = _mm_add_ps(ax, bx);
_mm_store_ps(a, ax);
}
}
QueryPerformanceCounter(&sse1);
ssetot.QuadPart = sse1.QuadPart - sse0.QuadPart;
std::cout << "Regular: " << regtot.QuadPart << std::endl;
std::cout << "SSE: " << ssetot.QuadPart << std::endl;
std::cout << "Ratio: " <<
float(regtot.QuadPart) / ssetot.QuadPart << std::endl;
_aligned_free(a1);
_aligned_free(a2);
}
