Saturday 28 March 2015

Gaining Performance Through High Level Abstraction - Part 2

Previously, I discussed how to gain performance by moving to higher level abstractions instead of going lower level as is commonly perceived.

Xpressing the non-obvious


One thing that bothered me about the original post was that, while the optimized version using the tokenizer was probably fast enough, it took quite a bit more code to get a 35% increase in performance.

Since then, I found another high level method that, while not as fast as the split method, offers a reasonable speed up without much or any additional code.  That method is using Boost.Xpressive.

Xpressive is an interesting library in that it implements a domain specific language within C++ which allows for the generation of very optimized regular expressions.

As an experiment, I implemented a version of the original parser using Xpressive.  That code looks like this;

What's nice about this is that it's short, clear, and reasonably fast. Here's the timing;


This runs about 18% faster than the optimized tokenize version with approximately the same amount of code. The 'split' version, shown below as a refresher, is still about 25% faster than this one, but this is considerably less code, which might be worth it.

As a further experiment, I implemented an identical solution using std::regex;


This is virtually identical to the Xpressive version, but slower; performance is closer to the original tokenizer version, which is good news for Xpressive and means I'll probably turn to that library if I need to do any regular expression work in the future.


Are we there yet?


I've updated the token test file with the added parsers available here.  At this point, you're probably thinking that this isn't optimizable any further without going really raw (like raw pointer raw!).  But this isn't the case, next up is another technique in which high level libraries free us up to change things in order to allow for higher performance.

Sunday 22 March 2015

Gaining Performance Through High Level Abstraction - Part 1

In my career, C++ is used for about 90% of everything. Personally, at work C++ is probably somewhere around 99% of the code I write. There are various reasons for choosing C++ for the projects I work on, but the two main ones are performance and cross platform compatibility.

Applications of this size, however, invariably have both their low level, performance critical sections, and their high level sections where performance is a non-issue. Here we find a problem with C++ in that, by default, it doesn't come with enough high level libraries to allow programmers to work on high level features in an efficient way, and in the end, that hurts performance of both the programmers and the end product. We need access to higher level libraries by default and hopefully by the end of this, you will agree.

All about the libs man!


One of the great strengths of C++, and possibly its greatest strength, is the ability to write both low level code and, through libraries, very high level code. The high level code can indeed get very high level as the language is malleable enough to allow one to effectively alter the language through libraries like Boost Spirit or the new range proposal. Of course other languages also have libraries, and I argue that the strength of the likes of Python, Java and C# are the libraries, not the languages themselves.

However, on big C++ projects, I often find that the use of anything outside of the C++ standard library is not encouraged, or even actively discouraged, usually due to performance concerns. Even the standard library is banned on some projects because the performance gurus don't want a stitch of software in the way of optimal performance.

The problem with this thinking is that it actually hurts performance because in reality, optimal performance is an illusion. On any software project, there's only so much time to make the product and that time needs to be split between making features and optimizing. This is where the famous 'premature optimization' quote comes in; any time you spend optimizing is time you don't have to build features, so if you optimize something that's not performance critical, you've wasted time.

I firmly believe this through my own experiences so I always prioritize building the feature first and optimizing later because I don't want to spend any time optimizing something that may get cut, or simply may not be important enough to optimize. But, in order to do this well, one needs access to high level algorithms and libraries, because in my experience, once one gets the hang of programming with algorithms, hand rolling each piece of code takes much longer than just using the algorithms. Algorithm centric code also generally contains fewer bugs and is easier to read because the reader doesn't need to decode each loop and conditional, so the time savings is applied project wide which eventually equates to even more time at the end of the project to optimize. It's a cumulative advantage. 

Highs and Lows


Recently, I hit a bug in a piece of production code that was programmed at a pretty low level. The function was parsing a string and the bug was a crash when a string containing only whitespace was passed to the function. To fix the bug, I had to;

- Determine what the code was trying to do (decode for-loops and conditionals)
- Determine how to fix the current code
- Apply and test the fix with various inputs.

Except, when I figured out what the code was doing, I realized I could replace all of the code with an existing library and it would reduce the code to a fraction of it's original size at the same time as fixing the bug. The original code looked something like this:

What this is essentially doing it parsing a comma separated list of strings and stripping whitespace in the process. I replaced the whole thing with this:

Here, we're using a high level abstraction from boost that does all of the tokenization logic for us. Less code, no bugs, much better.

This is where the performance gurus come in a ask about the performance differences. On my machine I get the following numbers parsing a string with 10 million commas;

This shows that the original version was about two times faster, but the bug is fixed and the code is smaller and easier to read, so as long as this particular code is not in a performance critical section, then I would call it done despite being slower. I would then use my saved time to optimize something that matters. In this context, that feels like a cop out though, so let's see if we can optimize this. Considering it took no time to write this simple implementation, we can afford a few cycles to optimize it.

First, reading the docs, it seems the tokenizer is doing more work than the original code, including looking for quotes and doing escape analysis. Plus, we have to manually remove the white space post tokenize, which is a bit of a bummer. Luckily, tokenizer exposes the ability to customize the separator in a way that makes it work more closely to what the original code was doing. So, lets change the code as follows to give that a go:

Here we've changed from using the escaped_char_separator to the simpler char_separator. Running this shows a significant improvement, in fact we're already beating the original implementation on performance and I'm pretty sure any reasonable programmer would still be just typing in the original version, let alone debugging or timing it and it only took a few minutes.

Suppose however, that this code did need to be optimized. The side effect of using high level libraries is that the code is much easier to understand, because in this case, what it's doing is plain as day; it's tokenizing a string based on ',' and whitespace. This means it's easier for someone to write a more optimal version should it actually be necessary.

Looking at what the tokenizer does, it's easy to see that the intent of this code is to split the string and trim whitespace, so a good first attempt to optimize is to do exactly that; we'll remove the tokenizer and use a slightly lower level library function; boost::split. This function takes the string, splits it and inserts the chunks into a container for you.

Not bad. Still pretty high level, the only extra code we need is to remove the whitspace manually because split doesn't know anything about that. However, there's a bug. Do you see it? Take a look, I'll wait.

If you found it good for you. If not, don't worry because it's a classic; std::isspace(*end) is of-by-one. It should be std::isspace(*(end-1)). By removing one level of abstraction we already introduced one bug, just by trying to optimize. Here's the corrected version.

That being said, we did fair better, the new implementation is about 40% faster than the original and 35% faster than the optimized tokenizer version.

Flying High


So, at this point, we have been able to write a quick and dirty tokenizer parser and go through two iterations of optimization before approaching the complexity of the original version. It's probably possible to come up with a more optimal solution should it be necessary, but it might not be. This is how high level libraries make your products faster, by allowing you to write features quickly and read the code easily, giving you the time to optimize the areas that matter. So, the next time you find yourself writing something at a lower level than you think you should, try to use a library for it. If such a library is not available to you, make an argument for it, and we can all fly a little higher.

A full test harness with all of the tests run is available on github here: https://gist.github.com/20e8b7034fabcf627cab.git