Sunday 1 November 2015

Beware of codegen differences between std::for_each and range-based-for in VS2015

std::for_each != for(auto i : a)


Switching between range-based-for and std::for_each should be transparent and seamless. But, it turns out that in VC2015, these forms of iteration do not always generate equivalent code. Unfortunately, it's not the case that one form is always better than the other and even more disturbing is that the performance difference in my tests can be up to 15x. Curious? Read on.

Background


Recently, I've been working on an abstraction that utilizes so called, 'fat iterators'. These are sort of smart iterators that adapt some concept into a range. The boost::zip_iterator is a good example of this; it joins two iterators together so that an operation can be run on them both using a single representation of the pair. Anyone who has ever wanted to sort one range in terms of the values in another, knows what this is good for.

Example: Say we have two std::vectors, one with names and one with corresponding ids.  In the current standard there's no way to sort those together. Instead, one needs to copy the data to a struct and sort that. That sucks, so enter boost::zip_iterator;

std::sort(
    boost::make_zip_iterator(names.begin(), ids.begin()),
    boost::make_zip_iterator(names.end(), ids.end())
);

Perfect. We've sorted alphabetically and the ids came with them. This is great and, hypothetically, is a free abstraction performance wise.

Update: Nov 7th, 2015 - Since the original publication, it was pointed out to me that this example doesn't actually work. It'll compile on C++03 compilers but will give the wrong result because of how the zip_iterator returns the values to the user which is via a tuple object containing references to the underlying values. Despite that, I think you get the idea of what fat iterators are all about.

The library I am working on uses a similar technique for a different abstraction, but it should also be, hypothetically, zero overhead so I have a large suite of performance benchmarks that run as part of my tests.

Problem 1: Range-based-for is sometimes slower than std::for_each


The other day, I added a test to make sure my iterators worked with the C++11 range-based-for. This was a cut and paste of a test that used std::for_each, but adapted to the newer syntax.  What I saw was a 20% reduction in performance.

20%!

I checked the code to make sure I didn't make a mistake but none was found. To be sure, I compiled the tests with gcc 5.2 and found no performance difference between the std::for_each and range-based-for, so this was definitely something compiler specific for VS2015.

I dumped the assembly of the loops in question to see what was going on. The std::for_each emitted this code:

movss       xmm0,dword ptr [rcx]  
addss       xmm0,xmm6  
movss       dword ptr [rcx],xmm0  
add         rcx,4  
cmp         rcx,rax  
jne         

movss       xmm0,dword ptr [rdx+rcx]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rcx]  
movss       dword ptr [rcx],xmm0  
add         rcx,4  
sub         r9,1  
jne         

movss       xmm0,dword ptr [rdx+rcx]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rcx]  
movss       dword ptr [rcx],xmm0  
add         rcx,4  
sub         r9,1  
jne         

Each block here represents the looping section of a std::for_each call that is performing some floating point adds and multiplies. I've removed the addresses for clarity, so I'll point out that the 'jne' at the end of each block sends us back to the top of that block until the loop is complete. Nothing too special here and this code looks reasonable.

Here's the corresponding range-based-for assembly:

mov         rax,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rax+rcx*4]  
addss       xmm0,xmm2  
movss       dword ptr [rax+rcx*4],xmm0  
inc         rcx  
cmp         rcx,rdx  
jne         

mov         rdx,qword ptr [rbp+250h]  
mov         rcx,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne         

mov         rdx,qword ptr [rbp+2A0h]  
mov         rcx,qword ptr [rbp+250h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne     

Again, we have a representation of three loops, but obviously there is more code.  Upon closer inspection though, we see that the code is mostly the same. Here are the highlighted differences:

mov         rax,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rax+rcx*4]  
addss       xmm0,xmm2  
movss       dword ptr [rax+rcx*4],xmm0  
inc         rcx  
cmp         rcx,rdx  
jne         

mov         rdx,qword ptr [rbp+250h]  
mov         rcx,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne         

mov         rdx,qword ptr [rbp+2A0h]  
mov         rcx,qword ptr [rbp+250h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne     

Five additional mov instructions, the rest is the same. The interesting thing is that those mov instructions don't do anything because $rpb doesn't change at all during the loop.  Take a look, I'll wait.

To me this implies a bug in the compiler where the address calculated for jne is wrong because, though the moves are initially required to setup $rdx and $rcx, after that everything is static. What should actually happen is jne should jump to the first movss instruction within each block and not re-run those initial moves on each iteration. To confirm this, I patched the executable to do just that and indeed I got the correct result and the performance problem was gone.

I've been trying for a few days now to come up with a distilled test that reveals the problem, but have been unable to do so.  However, during that process, I stumbled upon...

Problem 2: std::for_each is sometimes slower than range-based-for


While trying to isolate the original problem, I wrote the following program:

// compile with: cl.exe /EHsc /O2
#include <iostream>
#include <vector>
#include <algorithm>
#include <sstream>

#if RANGE_FOR
  char const* type = "Range: ";
#elif FOREACH_LAMBDA
  char const* type = "ForEach L: ";
#else
#  error "Must Define one of RANGE_FOR or FOREACH_LAMBDA"
#endif

typedef float value_type;

int main(int argc, char** argv)
{
    int num_values = 5 * 1000 * 1000;
    if(argc > 1)
    {
        std::stringstream param;
        param.str(argv[1]);
        param >> num_values;
        if(param.bad())
        {
            std::cout << "Failed to parse " 
                      << argv[1] << " as int.";
            return 1;
        }
    }
    int num_iterations = 2000;
    if(argc > 2)
    {
        std::stringstream param;
        param.str(argv[2]);
        param >> num_iterations;
        if(param.bad())
        {
            std::cout << "Failed to parse " 
                      << argv[2] << " as int.";
            return 1;
        }
    }
    
    std::vector<value_type> values(num_values, 1);

    for(int i = 0; i < num_iterations; ++i)
    {
    #if RANGE_FOR
        for(value_type& v : values)
        {
            v += 1;
        };
    #elif FOREACH_LAMBDA
        std::for_each(values.begin(), values.end(), 
            [](value_type& v)
            {
                v += 1;
            }
        );
    #endif
    }

    std::cout << type << values.front() << " " << total;
    
    return 0;
}

This is about a distilled as I could make it and to my delight, each case generated identical code. So, to try to make them generate different code, I added another variable.

    typedef float total_type;
    total_type total = 0;
    for(int i = 0; i < num_iterations; ++i)
    {
    #if RANGE_FOR
        for(value_type& v : values)
        {
            v += 1;
            total += v;
        };
    #else
        std::for_each(values.begin(), values.end(), 
            [&total](value_type& v)
            {
                v += 1;
                total += v;
            }
        );
    #endif
    }

This time they did generate different code, but it was std::for_each that was slower; an astounding 300% slower (9 seconds for range-based-for vs 28 seconds for std::for_each).

Here's the generated assembly from the std::for_each:

movss       xmm1,dword ptr [rax]  
addss       xmm1,xmm6  
movss       dword ptr [rax],xmm1  
movss       xmm0,dword ptr [total]  
addss       xmm0,xmm1  
movss       dword ptr [total],xmm0  
add         rax,4  
inc         rcx  
cmp         rcx,rdx  
jne         main+510h  

The two highlighted lines are the additional instructions that are absent from the range-based-for implementation, and like the first example, they're both completely unnecessary. If you can't tell, what that group of three lines (the two highlighted plus the one in between) are doing is this line from the source code:

total += v;

And a description of the generated assembly:

movss       xmm0,dword ptr [total]  # Load 'total' from memory
addss       xmm0,xmm1               # Add 'v' to 'total'
movss       dword ptr [total],xmm0  # Store 'total' back to memory

The load and the store here are completely unnecessary because 'total' is already kept in a register so we always have the latest value.  Bizarre.

At first, I thought that this was a problem with floating point somehow and the compiler was getting confused about being able to keep the value in a register or not.  So, I switching to ints.

Now the difference was an astounding 15x slower for std::for_each (1.4 seconds for range-based-for vs 23 seconds for std::for_each).  Huh?  How is that even possible?

Well, lets look at some assembly again.  Here's the std::for_each + lambda:

inc         dword ptr [rax]  
mov         ecx,dword ptr [rax]  
add         dword ptr [total],ecx  
lea         rax,[rax+4]  
inc         rdx  
cmp         rdx,r8  
jne         main+500h

That's not to bad, other than the extra work being done on storing the total back to memory on each iteration.

How does the range-based-for achieve a 15x speedup on this?

movdqu      xmm0,xmmword ptr [rax]  
paddd       xmm0,xmm2  
movdqu      xmmword ptr [rax],xmm0  
paddd       xmm0,xmm1  
movdqa      xmm1,xmm0  
movdqu      xmm0,xmmword ptr [rax+10h]  
paddd       xmm0,xmm2  
movdqu      xmmword ptr [rax+10h],xmm0  
paddd       xmm0,xmm3  
movdqa      xmm3,xmm0  
add         rax,20h  
add         rcx,8  
cmp         rcx,r9  
jne         main+520h  
cmp         rcx,r8  
je          main+570h

Ah, in this case the compiler realized that it could vectorize the loop and add up four ints at a time. Cool, but why didn't it do this for floats, you might ask?  The answer is because floating point addition is not transitive.  The way the vectorzation works is by adding up the numbers in SIMD lanes and then adding up all the lanes together at then end. Because this changes the order of the operations, and for floats such a change would slightly change the result, the compiler can not do it.

For fun, I told the compiler to compile the integer program with AVX2 support, which allows up to 8 integer operations per instruction and the run time dropped to 0.6 seconds, which represents a 40x speed improvement over the std::for_each version.

Ouch.

Anyway, back to the actual problem, Because this test case was much smaller than the one utilizing my fat iterators, I was able to find the problem; it's the capture of 'total' into the lambda.  If I convert the code from the lambda syntax to use a struct (ie:  the C++ 03 way) then I get optimal (though not auto vectorized) code again.

Huh, bummer. I really like lambdas.

To be thorough, I checked gcc 5.2 (mingw-64) to see what it generated and in all of my test cases, including the convoluted fat-iterator tests, I got identical code for both range-based-for and std::for_each.  Here's the gcc assembly for the floating point tests.

movss  (%rax),%xmm0
add    $0x4,%rax
addss  %xmm1,%xmm0
movss  %xmm0,-0x4(%rax)
cmp    %rax,%rdx
addss  %xmm0,%xmm6
jne    

Aside from the gdb disassembler syntax, and some instruction order, this version is the same as the optimal range-based-for from VC++2015.  Excellent!

Before we praise gcc too much though, it should be noted that at no time did gcc auto-vectorize any of the code, So there's a point for VC++ on that front.

One other test I did was calling the lambda from the rage-based-for just to make sure the issue wasn't somehow std::for_each related, and sure enough, once again the lambda capture variable was reloaded and stored on each iteration.

This indicates that this problem is not really related to the different looping constructs and is actually a problem with how VC++ handles lambdas.  However, because std::for_each encourages the use of lambdas it seems pretty relevant.

Wrapping Up


Sadly, I couldn't come up with a test case where VC++ generated bad code for the range-based-for, other than my convoluted fat iterators. I even experimented with some of the boost fat-iterators and was unable to reproduce the problem. I'm hoping someone can shed some light on that so we can come up with a workaround. I plan to check the results on VC2015 Update 1, which should be out shortly, to see if the problem is fixed. If not, I'll be filing a bug report with Microsoft pointing them to my test suite, which will hopefully be releasable at that point. I also plan to test the lambda version just discussed. If that also fails, then there will be two bug reports.

Update (Dec 7th, 2015): I tested everything on VS2015 Update 1 and found that all of the issues still occur. A bug has been logged with Microsoft so I expect it will be fixed.

Until then, consider this a PSA to always measure your code. If you have some performance sensitive code, then don't just blindly switch between range-based-for and std::for_each; make sure you measure and look at the generated code.

The full source code the for test harness is available below.  Compile with cl /EHsc /O2. Add /DRANGE_FOR=1 to test range-based-for, /DFOREACH_LAMBDA=1 to use the lambda syntax or /DFOREACH_FUNCTOR=1 to test the functor equivalent of the lambda. The test results you get may differ from mine, so for reference this was run on my laptop which is equipped with a quad-core Haswell i7-4700.

Update (Nov 2nd, 2015): I tested everything above on VS2012 and found that the none of the issues occur there, so this is looking very much like a problem isolated to VS2015.


Saturday 4 April 2015

Gaining Performance Through High Level Abstraction - Part 3

In part one and part two of this series, I've talked about how it's possible to increase performance through using high level abstractions because those abstractions usually allow the programmer to implement a feature faster which thereby affords them more time to either optimize the areas that really matter, or iterate on the existing solution until it's 'fast enough'. But there's still another area that using abstractions can allow easy optimization wins.


A quick refresher


In the previous posts we were able to achieve a maximum tokenization speed of 4.5 seconds for 10 millions comma separated values. This was using the split algorithm from boost, plus some manual whitespace trimming. The next fastest solution was using Boost.Xpressive to apply a regex tokenize to the string. The code for both versions are available here and here, which shows that the Xpressive solution is considerably shorter than the Split version, though it was 25% slower. As we know, less code usually means fewer bugs, so we really would like to keep the sorter version if possible.

But, what if we really need it to be less than 5 seconds? Assuming one currently has a working version of the Xpressive tokenizer, the first attempt should be to optimize that instead of writing something new. But how? There's only 3 lines of code!


Hiding behind your interface


High level data structures and algorithms present an interface to a user for 'doing something'. If these interfaces are well designed and consistent, then it becomes possible to substitute one data structure or algorithm for another when requirements change. Most of the STL is designed this way with consistent push_back, insert, begin/end, etc. It's not perfect but it's certainly very good.

The STL algorithms are something that are still a contentious issue in many real C++ shops. Some people love them, but others still go for the raw loops, completely avoiding <algorithm> altogether. I typically argue that the algorithms are better because they're self documenting and less likely to cause bugs, but it seems that this isn't a super strong argument as it's met with resistance because many programmers are not accustomed to the algorithms and are more easily able to recognize the common patterns in raw loop code. In understand this but believe that a professional programmer needs both to be able to read and write both.

However, the new parallel STL proposal is an example of where using these high level algorithms can gain you performance. If you need to parallelize some code quickly and that code was written with std:: algorithms, then they can be trivially parallelized by passing std::par to the algorithms. This is not a silver bullet if you want to maximize hardware utilization, but it's an easy way to gain performance with little effort; all because of the interface.

So what does this have to do with our parser? We're not going to make it faster via parallelism, but we are going to take advantage of a common interface.

One thing that each parser implementation is doing is making a copy of the sub string into the std::set. It would be nice if we could remove that.

Enter boost::string_ref

boost::string_ref is the basis for the proposed std::string_view, which implements a non-owning reference to a string. What's nice about this is the string_ref can be used to represent a slice of the owning string. To use it, we change the interfaces of the parser functions to use string_ref instead of std::string. Here's a modified xpressive_csv;


We've changed the function signature for both the input and the resulting std::set. This may not be possible in all situations because we are now forcing the caller to hold onto a copy of the original string and changed the type we're putting the results into. But because string_ref has a very similar interface to std::string, this will be possible in many situations with little or no changes for the user. We've also had to modify the regex types to use the iterator type from string_ref because smatch, etc are really just typedefs of the base template for std::string iterators. There are also typedefs for char* (cmatch, et al), which we could use but it seemed better to use the string_ref iterators.

Running this code is about 30% faster than before and about 10% faster than the original split method. Split is now also 30% faster than and running in a mere 3 seconds. Here's that code;


And the corresponding timings for both;


What about the set?


The next logical step from here is to replace the std::set with something a little better. My first thought was to use something like boost::container::flat_set, which is essentially a sorted array with a std::set interface. Using this container is a straight forward transformation, but doing so results in extremely slow code.

This actually takes so long to run that I had to cancel it. The problem is that flat_set insertion is very slow because, being a sorted array, inserting out of order is an O(N^2) operation. The solution is to accumulate the results into a separate list, sort that and then insert into flat_set in order. Both boost flat_set and flap_map expose separate functions for doing this precise thing; inserting a list of sorted unique items.

However, running this code shows that it's actually slower. This is where constant time factors come in. Both solutions are N*log(N), but the set insertion starts with a smaller number of elements in it so it's actually log(1) + log(2) ... log(N), which converges on N log(N). std::sort is working from a full N right from the get go and that's a real difference. There are still potential benefits to using this technique though.  One was mentioned earlier in regard to the parallel algorithm; we could invoke a parallel sort on the result and speed that up significantly.  Another is related to how the data is consumed post parse; if this set is essentially immutable after the parse, it might be better to sacrifice the parse speed for better performance when consuming the data; something that flat_set can make a very real difference with because it utilizes the cache much more effectively. But that discussion is for another time.

Saturday 28 March 2015

Gaining Performance Through High Level Abstraction - Part 2

Previously, I discussed how to gain performance by moving to higher level abstractions instead of going lower level as is commonly perceived.

Xpressing the non-obvious


One thing that bothered me about the original post was that, while the optimized version using the tokenizer was probably fast enough, it took quite a bit more code to get a 35% increase in performance.

Since then, I found another high level method that, while not as fast as the split method, offers a reasonable speed up without much or any additional code.  That method is using Boost.Xpressive.

Xpressive is an interesting library in that it implements a domain specific language within C++ which allows for the generation of very optimized regular expressions.

As an experiment, I implemented a version of the original parser using Xpressive.  That code looks like this;

What's nice about this is that it's short, clear, and reasonably fast. Here's the timing;


This runs about 18% faster than the optimized tokenize version with approximately the same amount of code. The 'split' version, shown below as a refresher, is still about 25% faster than this one, but this is considerably less code, which might be worth it.

As a further experiment, I implemented an identical solution using std::regex;


This is virtually identical to the Xpressive version, but slower; performance is closer to the original tokenizer version, which is good news for Xpressive and means I'll probably turn to that library if I need to do any regular expression work in the future.


Are we there yet?


I've updated the token test file with the added parsers available here.  At this point, you're probably thinking that this isn't optimizable any further without going really raw (like raw pointer raw!).  But this isn't the case, next up is another technique in which high level libraries free us up to change things in order to allow for higher performance.

Sunday 22 March 2015

Gaining Performance Through High Level Abstraction - Part 1

In my career, C++ is used for about 90% of everything. Personally, at work C++ is probably somewhere around 99% of the code I write. There are various reasons for choosing C++ for the projects I work on, but the two main ones are performance and cross platform compatibility.

Applications of this size, however, invariably have both their low level, performance critical sections, and their high level sections where performance is a non-issue. Here we find a problem with C++ in that, by default, it doesn't come with enough high level libraries to allow programmers to work on high level features in an efficient way, and in the end, that hurts performance of both the programmers and the end product. We need access to higher level libraries by default and hopefully by the end of this, you will agree.

All about the libs man!


One of the great strengths of C++, and possibly its greatest strength, is the ability to write both low level code and, through libraries, very high level code. The high level code can indeed get very high level as the language is malleable enough to allow one to effectively alter the language through libraries like Boost Spirit or the new range proposal. Of course other languages also have libraries, and I argue that the strength of the likes of Python, Java and C# are the libraries, not the languages themselves.

However, on big C++ projects, I often find that the use of anything outside of the C++ standard library is not encouraged, or even actively discouraged, usually due to performance concerns. Even the standard library is banned on some projects because the performance gurus don't want a stitch of software in the way of optimal performance.

The problem with this thinking is that it actually hurts performance because in reality, optimal performance is an illusion. On any software project, there's only so much time to make the product and that time needs to be split between making features and optimizing. This is where the famous 'premature optimization' quote comes in; any time you spend optimizing is time you don't have to build features, so if you optimize something that's not performance critical, you've wasted time.

I firmly believe this through my own experiences so I always prioritize building the feature first and optimizing later because I don't want to spend any time optimizing something that may get cut, or simply may not be important enough to optimize. But, in order to do this well, one needs access to high level algorithms and libraries, because in my experience, once one gets the hang of programming with algorithms, hand rolling each piece of code takes much longer than just using the algorithms. Algorithm centric code also generally contains fewer bugs and is easier to read because the reader doesn't need to decode each loop and conditional, so the time savings is applied project wide which eventually equates to even more time at the end of the project to optimize. It's a cumulative advantage. 

Highs and Lows


Recently, I hit a bug in a piece of production code that was programmed at a pretty low level. The function was parsing a string and the bug was a crash when a string containing only whitespace was passed to the function. To fix the bug, I had to;

- Determine what the code was trying to do (decode for-loops and conditionals)
- Determine how to fix the current code
- Apply and test the fix with various inputs.

Except, when I figured out what the code was doing, I realized I could replace all of the code with an existing library and it would reduce the code to a fraction of it's original size at the same time as fixing the bug. The original code looked something like this:

What this is essentially doing it parsing a comma separated list of strings and stripping whitespace in the process. I replaced the whole thing with this:

Here, we're using a high level abstraction from boost that does all of the tokenization logic for us. Less code, no bugs, much better.

This is where the performance gurus come in a ask about the performance differences. On my machine I get the following numbers parsing a string with 10 million commas;

This shows that the original version was about two times faster, but the bug is fixed and the code is smaller and easier to read, so as long as this particular code is not in a performance critical section, then I would call it done despite being slower. I would then use my saved time to optimize something that matters. In this context, that feels like a cop out though, so let's see if we can optimize this. Considering it took no time to write this simple implementation, we can afford a few cycles to optimize it.

First, reading the docs, it seems the tokenizer is doing more work than the original code, including looking for quotes and doing escape analysis. Plus, we have to manually remove the white space post tokenize, which is a bit of a bummer. Luckily, tokenizer exposes the ability to customize the separator in a way that makes it work more closely to what the original code was doing. So, lets change the code as follows to give that a go:

Here we've changed from using the escaped_char_separator to the simpler char_separator. Running this shows a significant improvement, in fact we're already beating the original implementation on performance and I'm pretty sure any reasonable programmer would still be just typing in the original version, let alone debugging or timing it and it only took a few minutes.

Suppose however, that this code did need to be optimized. The side effect of using high level libraries is that the code is much easier to understand, because in this case, what it's doing is plain as day; it's tokenizing a string based on ',' and whitespace. This means it's easier for someone to write a more optimal version should it actually be necessary.

Looking at what the tokenizer does, it's easy to see that the intent of this code is to split the string and trim whitespace, so a good first attempt to optimize is to do exactly that; we'll remove the tokenizer and use a slightly lower level library function; boost::split. This function takes the string, splits it and inserts the chunks into a container for you.

Not bad. Still pretty high level, the only extra code we need is to remove the whitspace manually because split doesn't know anything about that. However, there's a bug. Do you see it? Take a look, I'll wait.

If you found it good for you. If not, don't worry because it's a classic; std::isspace(*end) is of-by-one. It should be std::isspace(*(end-1)). By removing one level of abstraction we already introduced one bug, just by trying to optimize. Here's the corrected version.

That being said, we did fair better, the new implementation is about 40% faster than the original and 35% faster than the optimized tokenizer version.

Flying High


So, at this point, we have been able to write a quick and dirty tokenizer parser and go through two iterations of optimization before approaching the complexity of the original version. It's probably possible to come up with a more optimal solution should it be necessary, but it might not be. This is how high level libraries make your products faster, by allowing you to write features quickly and read the code easily, giving you the time to optimize the areas that matter. So, the next time you find yourself writing something at a lower level than you think you should, try to use a library for it. If such a library is not available to you, make an argument for it, and we can all fly a little higher.

A full test harness with all of the tests run is available on github here: https://gist.github.com/20e8b7034fabcf627cab.git

Tuesday 27 May 2014

The performance of unaligned SIMD loads and stores on modern hardware

A few weeks ago a colleague and I were discussing the performance characteristics of compiler auto-vectorized code.  Inevitably, concerns over memory alignment came up, and with it, suspicions that the compiler could not possibly generate optimal code because it would need to assume memory was not aligned.  I pointed out that or modern processors, unaligned loads and stores are supposed to be of similar performance to the historical aligned versions. My colleague was less convinced.

So I tested it.

Hypothesis

On modern processors there is little to no penalty for using unaligned loads and stores to SIMD registers.

TL;DR

The details of the tests are below, but if you're the type that likes to skip to the good parts, I will tell you this; Generally, you don't need worry about the alignment of your data when using SIMD on modern processors such as Haswell.

If you're the type that likes a good story, keep reading.

Setup

The test consists of several different implementations of what are essentially a memcpy; a piece of memory is loaded to a SIMD register and then stored back to a different address using one of the following load/store pairs;

_mm_load_ps/_mm_store_ps
_mm_loadu_ps/_mm_storeu_ps
_mm256_load_ps/_mm256_store_ps
_mm256_loadu_ps/_mm256_storeu_ps

The mm_load/store versions come from SSE and operate on 4 floats at a time requiring 16 byte alignment.  _mm256_load/store versions come from AVX and operate on 8 floats at a time requiring 32 byte alignment.  The versions denoted by 'u' have no alignment requirements.

As a baseline, I also timed a naive for-loop, std::memcpy and std::copy, over the same data.

For fun, I added SSE and AVX implementations that use the so called 'non-temporal' instructions which hint to the processor that the data just stored will not be needed for a while thereby improving cache performance.

Each test is implemented as a function which is called from a test harness to time it and output results.  For example, the naive for-loop copy looks like this;

The aligned SSE implementation is as follows;

By feeding each function pointers with alignments ranging from 4 to 64 bytes, we can test the performance characteristics of each.  However, it must be noted that the aligned instructions require aligned memory, or the program will crash so those functions are only called with appropriately aligned data.

The full source is available on github.

Experiment and Results

I compiled the test program using Visual Studio 2013 and ran everything on my laptop, which is fitted with a Haswell i7-4700.  The command line I used for compilation was cl.exe /EHsc /Ox /arch:AVX2 simd-copy.cpp. I also experimented with running a version compiled with GCC 4.8.2 and got similar results.

The first test I ran moved 1 billion floats in 256 kbyte chunks.  Ie:  I allocated 64k floats and copied them from one location to another.  This was repeated 16k times for a total of about 1 billion float copies, or 4 gbytes.  This produced the following result;

This was a surprising result to me, so let's go over a few points;
  • The calls to std::copy and std::memcpy are identical. This is to be expected with an optimized version of std::copy which inlines to a call to memcpy
  • std::copy and std::memcpy are the fastest. 
  • The naive for-loop copy is the slowest by a large margin
  • The unaligned SSE/AVX calls are about the same and are about 30% slower than std::copy 
  • The aligned SSE calls are the same as the unaligned calls 
  • The aligned AVX calls are about in between SSE and std::copy
  • The non-temporal instructions has essentially no effect
  • std::copy shows a small speedup on 32 byte boundaries
  • Alignment made very little difference in SSE/AVX performance

I ran the test again on a lot more data in an attempt to create a bigger rift.  That test looks like this;
  • Overall results are about the same as the previous test
  • Non-temporal performance is now better
  • The difference between the naive copy and the others is now smaller

The results here are essentially the same as the previous test, with alignment making little or no difference, even though we're now processing 16 times more data. However, the non-temporal instructions are now performing really well. Why? The answer is cache. The non-temporal instructions are doing exactly what they are supposed to be doing; they tell the processor that the data we just wrote doesn't need to be kept in cache which allows more of the cache to be populated with data prefetched from the source buffer. The naive copy is doing better for a related but a slightly different reason.  In this case, the program is simply becoming more memory bound populating the cache from the source buffer.  When this happens, it doesn't matter how fast the load from cache to register is because most of the time is spent waiting for data. The results didn't show up like this in the first test because my processor has 256k of L1 cache and all the way up to 6 megs of L3 cache.

So then, what happens if we throw a lot of data at it, like 512 megs?  Well, this;
And what about a small set?
By now a pattern has emerged -- it's clear that the test methods are moving around relative to each other with changing data set size, but at no time does alignment play a role in the performance of loading and storing SSE/AVX data.

Conclusion 

Don't worry about data alignment when writing SIMD code on modern hardware. The tests above adequately demonstrate that, at least on Haswell, alignment isn't going to help you. This is very good news because it frees us up from worrying about these things and allows us to write more features.

In addition, let this be a reminder that if you're intuition is telling you where you might be able to find some more speed (say by aligning all of your SIMD allocations), stop and measure first.  If I had not, the alignment may have actually slowed my program down by making the memory allocator work harder to find a suitable slot.

If anyone else out there has some different processors this can be tested on, I encourage you to do so and post the result here so we can learn how different processors differ in performance.