Tuesday 27 May 2014

The performance of unaligned SIMD loads and stores on modern hardware

A few weeks ago a colleague and I were discussing the performance characteristics of compiler auto-vectorized code.  Inevitably, concerns over memory alignment came up, and with it, suspicions that the compiler could not possibly generate optimal code because it would need to assume memory was not aligned.  I pointed out that or modern processors, unaligned loads and stores are supposed to be of similar performance to the historical aligned versions. My colleague was less convinced.

So I tested it.

Hypothesis

On modern processors there is little to no penalty for using unaligned loads and stores to SIMD registers.

TL;DR

The details of the tests are below, but if you're the type that likes to skip to the good parts, I will tell you this; Generally, you don't need worry about the alignment of your data when using SIMD on modern processors such as Haswell.

If you're the type that likes a good story, keep reading.

Setup

The test consists of several different implementations of what are essentially a memcpy; a piece of memory is loaded to a SIMD register and then stored back to a different address using one of the following load/store pairs;

_mm_load_ps/_mm_store_ps
_mm_loadu_ps/_mm_storeu_ps
_mm256_load_ps/_mm256_store_ps
_mm256_loadu_ps/_mm256_storeu_ps

The mm_load/store versions come from SSE and operate on 4 floats at a time requiring 16 byte alignment.  _mm256_load/store versions come from AVX and operate on 8 floats at a time requiring 32 byte alignment.  The versions denoted by 'u' have no alignment requirements.

As a baseline, I also timed a naive for-loop, std::memcpy and std::copy, over the same data.

For fun, I added SSE and AVX implementations that use the so called 'non-temporal' instructions which hint to the processor that the data just stored will not be needed for a while thereby improving cache performance.

Each test is implemented as a function which is called from a test harness to time it and output results.  For example, the naive for-loop copy looks like this;

The aligned SSE implementation is as follows;

By feeding each function pointers with alignments ranging from 4 to 64 bytes, we can test the performance characteristics of each.  However, it must be noted that the aligned instructions require aligned memory, or the program will crash so those functions are only called with appropriately aligned data.

The full source is available on github.

Experiment and Results

I compiled the test program using Visual Studio 2013 and ran everything on my laptop, which is fitted with a Haswell i7-4700.  The command line I used for compilation was cl.exe /EHsc /Ox /arch:AVX2 simd-copy.cpp. I also experimented with running a version compiled with GCC 4.8.2 and got similar results.

The first test I ran moved 1 billion floats in 256 kbyte chunks.  Ie:  I allocated 64k floats and copied them from one location to another.  This was repeated 16k times for a total of about 1 billion float copies, or 4 gbytes.  This produced the following result;

This was a surprising result to me, so let's go over a few points;
  • The calls to std::copy and std::memcpy are identical. This is to be expected with an optimized version of std::copy which inlines to a call to memcpy
  • std::copy and std::memcpy are the fastest. 
  • The naive for-loop copy is the slowest by a large margin
  • The unaligned SSE/AVX calls are about the same and are about 30% slower than std::copy 
  • The aligned SSE calls are the same as the unaligned calls 
  • The aligned AVX calls are about in between SSE and std::copy
  • The non-temporal instructions has essentially no effect
  • std::copy shows a small speedup on 32 byte boundaries
  • Alignment made very little difference in SSE/AVX performance

I ran the test again on a lot more data in an attempt to create a bigger rift.  That test looks like this;
  • Overall results are about the same as the previous test
  • Non-temporal performance is now better
  • The difference between the naive copy and the others is now smaller

The results here are essentially the same as the previous test, with alignment making little or no difference, even though we're now processing 16 times more data. However, the non-temporal instructions are now performing really well. Why? The answer is cache. The non-temporal instructions are doing exactly what they are supposed to be doing; they tell the processor that the data we just wrote doesn't need to be kept in cache which allows more of the cache to be populated with data prefetched from the source buffer. The naive copy is doing better for a related but a slightly different reason.  In this case, the program is simply becoming more memory bound populating the cache from the source buffer.  When this happens, it doesn't matter how fast the load from cache to register is because most of the time is spent waiting for data. The results didn't show up like this in the first test because my processor has 256k of L1 cache and all the way up to 6 megs of L3 cache.

So then, what happens if we throw a lot of data at it, like 512 megs?  Well, this;
And what about a small set?
By now a pattern has emerged -- it's clear that the test methods are moving around relative to each other with changing data set size, but at no time does alignment play a role in the performance of loading and storing SSE/AVX data.

Conclusion 

Don't worry about data alignment when writing SIMD code on modern hardware. The tests above adequately demonstrate that, at least on Haswell, alignment isn't going to help you. This is very good news because it frees us up from worrying about these things and allows us to write more features.

In addition, let this be a reminder that if you're intuition is telling you where you might be able to find some more speed (say by aligning all of your SIMD allocations), stop and measure first.  If I had not, the alignment may have actually slowed my program down by making the memory allocator work harder to find a suitable slot.

If anyone else out there has some different processors this can be tested on, I encourage you to do so and post the result here so we can learn how different processors differ in performance.

5 comments:

  1. You might find this stackoverflow question interesting. I think your results largely agree.

    http://stackoverflow.com/questions/1715224/very-fast-memcpy-for-image-processing/18251329

    ReplyDelete
  2. In all the cases, you are getting higher latencies for for-loop. But in my system, for-loop is outperforming SSE/AVX some of the times. I am using Intel i7-4770 CPU @ 3.40GHz, 256K L1, 1M L2, 8M L3, system.
    1. Any idea?
    2. Why AVX is slower than memcpy/std::copy for larger chunks?
    The below data is for default case (Compiler & OS: gcc 4.8.1, Ubuntu 13.04)
    [4,0.155559,0.141459,0.161394,0.183352,0.147407,1.6e-05,1.6e-05,1.6e-05,1.6e-05]
    [16,0.141163,0.141232,0.147006,0.150401,0.148916,0.140791,0.210767,1.6e-05,1.6e-05]
    [32,0.141557,0.141623,0.144863,0.150158,0.144671,0.141189,0.210339,0.144107,0.192292]
    [64,0.141123,0.141167,0.143819,0.150095,0.144601,0.140936,0.210099,0.14391,0.192065]

    ReplyDelete
    Replies
    1. 1. I would need to look at the disassembly to know for sure, but it's possible that the for-loop as actually generating vectorized code on your compiler.
      2. Well, I mean, we're doing a memcpy with instructions that are not meant for that so I would not expect it to be faster than a memcpy, which on my machine generates a single instruction; rep movs. The point was to compare aligned vs. unaligned loads. I only included the memcpy to use for context.

      Delete
    2. Thanks Chris,
      I was more interested in memcpy performances, anyways it was very nice and helpful post. I will look at assembly codes generated by my compiler for details.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete