Sunday 1 November 2015

Beware of codegen differences between std::for_each and range-based-for in VS2015

std::for_each != for(auto i : a)


Switching between range-based-for and std::for_each should be transparent and seamless. But, it turns out that in VC2015, these forms of iteration do not always generate equivalent code. Unfortunately, it's not the case that one form is always better than the other and even more disturbing is that the performance difference in my tests can be up to 15x. Curious? Read on.

Background


Recently, I've been working on an abstraction that utilizes so called, 'fat iterators'. These are sort of smart iterators that adapt some concept into a range. The boost::zip_iterator is a good example of this; it joins two iterators together so that an operation can be run on them both using a single representation of the pair. Anyone who has ever wanted to sort one range in terms of the values in another, knows what this is good for.

Example: Say we have two std::vectors, one with names and one with corresponding ids.  In the current standard there's no way to sort those together. Instead, one needs to copy the data to a struct and sort that. That sucks, so enter boost::zip_iterator;

std::sort(
    boost::make_zip_iterator(names.begin(), ids.begin()),
    boost::make_zip_iterator(names.end(), ids.end())
);

Perfect. We've sorted alphabetically and the ids came with them. This is great and, hypothetically, is a free abstraction performance wise.

Update: Nov 7th, 2015 - Since the original publication, it was pointed out to me that this example doesn't actually work. It'll compile on C++03 compilers but will give the wrong result because of how the zip_iterator returns the values to the user which is via a tuple object containing references to the underlying values. Despite that, I think you get the idea of what fat iterators are all about.

The library I am working on uses a similar technique for a different abstraction, but it should also be, hypothetically, zero overhead so I have a large suite of performance benchmarks that run as part of my tests.

Problem 1: Range-based-for is sometimes slower than std::for_each


The other day, I added a test to make sure my iterators worked with the C++11 range-based-for. This was a cut and paste of a test that used std::for_each, but adapted to the newer syntax.  What I saw was a 20% reduction in performance.

20%!

I checked the code to make sure I didn't make a mistake but none was found. To be sure, I compiled the tests with gcc 5.2 and found no performance difference between the std::for_each and range-based-for, so this was definitely something compiler specific for VS2015.

I dumped the assembly of the loops in question to see what was going on. The std::for_each emitted this code:

movss       xmm0,dword ptr [rcx]  
addss       xmm0,xmm6  
movss       dword ptr [rcx],xmm0  
add         rcx,4  
cmp         rcx,rax  
jne         

movss       xmm0,dword ptr [rdx+rcx]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rcx]  
movss       dword ptr [rcx],xmm0  
add         rcx,4  
sub         r9,1  
jne         

movss       xmm0,dword ptr [rdx+rcx]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rcx]  
movss       dword ptr [rcx],xmm0  
add         rcx,4  
sub         r9,1  
jne         

Each block here represents the looping section of a std::for_each call that is performing some floating point adds and multiplies. I've removed the addresses for clarity, so I'll point out that the 'jne' at the end of each block sends us back to the top of that block until the loop is complete. Nothing too special here and this code looks reasonable.

Here's the corresponding range-based-for assembly:

mov         rax,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rax+rcx*4]  
addss       xmm0,xmm2  
movss       dword ptr [rax+rcx*4],xmm0  
inc         rcx  
cmp         rcx,rdx  
jne         

mov         rdx,qword ptr [rbp+250h]  
mov         rcx,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne         

mov         rdx,qword ptr [rbp+2A0h]  
mov         rcx,qword ptr [rbp+250h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne     

Again, we have a representation of three loops, but obviously there is more code.  Upon closer inspection though, we see that the code is mostly the same. Here are the highlighted differences:

mov         rax,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rax+rcx*4]  
addss       xmm0,xmm2  
movss       dword ptr [rax+rcx*4],xmm0  
inc         rcx  
cmp         rcx,rdx  
jne         

mov         rdx,qword ptr [rbp+250h]  
mov         rcx,qword ptr [rbp+200h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne         

mov         rdx,qword ptr [rbp+2A0h]  
mov         rcx,qword ptr [rbp+250h]  
movss       xmm0,dword ptr [rcx+rax*4]  
mulss       xmm0,xmm1  
addss       xmm0,dword ptr [rdx+rax*4]  
movss       dword ptr [rdx+rax*4],xmm0  
inc         rax  
cmp         rax,r8  
jne     

Five additional mov instructions, the rest is the same. The interesting thing is that those mov instructions don't do anything because $rpb doesn't change at all during the loop.  Take a look, I'll wait.

To me this implies a bug in the compiler where the address calculated for jne is wrong because, though the moves are initially required to setup $rdx and $rcx, after that everything is static. What should actually happen is jne should jump to the first movss instruction within each block and not re-run those initial moves on each iteration. To confirm this, I patched the executable to do just that and indeed I got the correct result and the performance problem was gone.

I've been trying for a few days now to come up with a distilled test that reveals the problem, but have been unable to do so.  However, during that process, I stumbled upon...

Problem 2: std::for_each is sometimes slower than range-based-for


While trying to isolate the original problem, I wrote the following program:

// compile with: cl.exe /EHsc /O2
#include <iostream>
#include <vector>
#include <algorithm>
#include <sstream>

#if RANGE_FOR
  char const* type = "Range: ";
#elif FOREACH_LAMBDA
  char const* type = "ForEach L: ";
#else
#  error "Must Define one of RANGE_FOR or FOREACH_LAMBDA"
#endif

typedef float value_type;

int main(int argc, char** argv)
{
    int num_values = 5 * 1000 * 1000;
    if(argc > 1)
    {
        std::stringstream param;
        param.str(argv[1]);
        param >> num_values;
        if(param.bad())
        {
            std::cout << "Failed to parse " 
                      << argv[1] << " as int.";
            return 1;
        }
    }
    int num_iterations = 2000;
    if(argc > 2)
    {
        std::stringstream param;
        param.str(argv[2]);
        param >> num_iterations;
        if(param.bad())
        {
            std::cout << "Failed to parse " 
                      << argv[2] << " as int.";
            return 1;
        }
    }
    
    std::vector<value_type> values(num_values, 1);

    for(int i = 0; i < num_iterations; ++i)
    {
    #if RANGE_FOR
        for(value_type& v : values)
        {
            v += 1;
        };
    #elif FOREACH_LAMBDA
        std::for_each(values.begin(), values.end(), 
            [](value_type& v)
            {
                v += 1;
            }
        );
    #endif
    }

    std::cout << type << values.front() << " " << total;
    
    return 0;
}

This is about a distilled as I could make it and to my delight, each case generated identical code. So, to try to make them generate different code, I added another variable.

    typedef float total_type;
    total_type total = 0;
    for(int i = 0; i < num_iterations; ++i)
    {
    #if RANGE_FOR
        for(value_type& v : values)
        {
            v += 1;
            total += v;
        };
    #else
        std::for_each(values.begin(), values.end(), 
            [&total](value_type& v)
            {
                v += 1;
                total += v;
            }
        );
    #endif
    }

This time they did generate different code, but it was std::for_each that was slower; an astounding 300% slower (9 seconds for range-based-for vs 28 seconds for std::for_each).

Here's the generated assembly from the std::for_each:

movss       xmm1,dword ptr [rax]  
addss       xmm1,xmm6  
movss       dword ptr [rax],xmm1  
movss       xmm0,dword ptr [total]  
addss       xmm0,xmm1  
movss       dword ptr [total],xmm0  
add         rax,4  
inc         rcx  
cmp         rcx,rdx  
jne         main+510h  

The two highlighted lines are the additional instructions that are absent from the range-based-for implementation, and like the first example, they're both completely unnecessary. If you can't tell, what that group of three lines (the two highlighted plus the one in between) are doing is this line from the source code:

total += v;

And a description of the generated assembly:

movss       xmm0,dword ptr [total]  # Load 'total' from memory
addss       xmm0,xmm1               # Add 'v' to 'total'
movss       dword ptr [total],xmm0  # Store 'total' back to memory

The load and the store here are completely unnecessary because 'total' is already kept in a register so we always have the latest value.  Bizarre.

At first, I thought that this was a problem with floating point somehow and the compiler was getting confused about being able to keep the value in a register or not.  So, I switching to ints.

Now the difference was an astounding 15x slower for std::for_each (1.4 seconds for range-based-for vs 23 seconds for std::for_each).  Huh?  How is that even possible?

Well, lets look at some assembly again.  Here's the std::for_each + lambda:

inc         dword ptr [rax]  
mov         ecx,dword ptr [rax]  
add         dword ptr [total],ecx  
lea         rax,[rax+4]  
inc         rdx  
cmp         rdx,r8  
jne         main+500h

That's not to bad, other than the extra work being done on storing the total back to memory on each iteration.

How does the range-based-for achieve a 15x speedup on this?

movdqu      xmm0,xmmword ptr [rax]  
paddd       xmm0,xmm2  
movdqu      xmmword ptr [rax],xmm0  
paddd       xmm0,xmm1  
movdqa      xmm1,xmm0  
movdqu      xmm0,xmmword ptr [rax+10h]  
paddd       xmm0,xmm2  
movdqu      xmmword ptr [rax+10h],xmm0  
paddd       xmm0,xmm3  
movdqa      xmm3,xmm0  
add         rax,20h  
add         rcx,8  
cmp         rcx,r9  
jne         main+520h  
cmp         rcx,r8  
je          main+570h

Ah, in this case the compiler realized that it could vectorize the loop and add up four ints at a time. Cool, but why didn't it do this for floats, you might ask?  The answer is because floating point addition is not transitive.  The way the vectorzation works is by adding up the numbers in SIMD lanes and then adding up all the lanes together at then end. Because this changes the order of the operations, and for floats such a change would slightly change the result, the compiler can not do it.

For fun, I told the compiler to compile the integer program with AVX2 support, which allows up to 8 integer operations per instruction and the run time dropped to 0.6 seconds, which represents a 40x speed improvement over the std::for_each version.

Ouch.

Anyway, back to the actual problem, Because this test case was much smaller than the one utilizing my fat iterators, I was able to find the problem; it's the capture of 'total' into the lambda.  If I convert the code from the lambda syntax to use a struct (ie:  the C++ 03 way) then I get optimal (though not auto vectorized) code again.

Huh, bummer. I really like lambdas.

To be thorough, I checked gcc 5.2 (mingw-64) to see what it generated and in all of my test cases, including the convoluted fat-iterator tests, I got identical code for both range-based-for and std::for_each.  Here's the gcc assembly for the floating point tests.

movss  (%rax),%xmm0
add    $0x4,%rax
addss  %xmm1,%xmm0
movss  %xmm0,-0x4(%rax)
cmp    %rax,%rdx
addss  %xmm0,%xmm6
jne    

Aside from the gdb disassembler syntax, and some instruction order, this version is the same as the optimal range-based-for from VC++2015.  Excellent!

Before we praise gcc too much though, it should be noted that at no time did gcc auto-vectorize any of the code, So there's a point for VC++ on that front.

One other test I did was calling the lambda from the rage-based-for just to make sure the issue wasn't somehow std::for_each related, and sure enough, once again the lambda capture variable was reloaded and stored on each iteration.

This indicates that this problem is not really related to the different looping constructs and is actually a problem with how VC++ handles lambdas.  However, because std::for_each encourages the use of lambdas it seems pretty relevant.

Wrapping Up


Sadly, I couldn't come up with a test case where VC++ generated bad code for the range-based-for, other than my convoluted fat iterators. I even experimented with some of the boost fat-iterators and was unable to reproduce the problem. I'm hoping someone can shed some light on that so we can come up with a workaround. I plan to check the results on VC2015 Update 1, which should be out shortly, to see if the problem is fixed. If not, I'll be filing a bug report with Microsoft pointing them to my test suite, which will hopefully be releasable at that point. I also plan to test the lambda version just discussed. If that also fails, then there will be two bug reports.

Update (Dec 7th, 2015): I tested everything on VS2015 Update 1 and found that all of the issues still occur. A bug has been logged with Microsoft so I expect it will be fixed.

Until then, consider this a PSA to always measure your code. If you have some performance sensitive code, then don't just blindly switch between range-based-for and std::for_each; make sure you measure and look at the generated code.

The full source code the for test harness is available below.  Compile with cl /EHsc /O2. Add /DRANGE_FOR=1 to test range-based-for, /DFOREACH_LAMBDA=1 to use the lambda syntax or /DFOREACH_FUNCTOR=1 to test the functor equivalent of the lambda. The test results you get may differ from mine, so for reference this was run on my laptop which is equipped with a quad-core Haswell i7-4700.

Update (Nov 2nd, 2015): I tested everything above on VS2012 and found that the none of the issues occur there, so this is looking very much like a problem isolated to VS2015.


2 comments:

  1. You should *totally* file a Microsoft Connect bug in the mean time, even if you have incomplete info, just so that it's on their radar.

    ReplyDelete
  2. I have done just that. It's possible that the issues don't occur on Update 1, so I will retest on that as well and we'll see. Thanks for reading.

    ReplyDelete