In part one and part two of this series, I've talked about how it's possible to increase performance through using high level abstractions because those abstractions usually allow the programmer to implement a feature faster which thereby affords them more time to either optimize the areas that really matter, or iterate on the existing solution until it's 'fast enough'. But there's still another area that using abstractions can allow easy optimization wins.
In the previous posts we were able to achieve a maximum tokenization speed of 4.5 seconds for 10 millions comma separated values. This was using the split algorithm from boost, plus some manual whitespace trimming. The next fastest solution was using Boost.Xpressive to apply a regex tokenize to the string. The code for both versions are available here and here, which shows that the Xpressive solution is considerably shorter than the Split version, though it was 25% slower. As we know, less code usually means fewer bugs, so we really would like to keep the sorter version if possible.
But, what if we really need it to be less than 5 seconds? Assuming one currently has a working version of the Xpressive tokenizer, the first attempt should be to optimize that instead of writing something new. But how? There's only 3 lines of code!
High level data structures and algorithms present an interface to a user for 'doing something'. If these interfaces are well designed and consistent, then it becomes possible to substitute one data structure or algorithm for another when requirements change. Most of the STL is designed this way with consistent push_back, insert, begin/end, etc. It's not perfect but it's certainly very good.
The STL algorithms are something that are still a contentious issue in many real C++ shops. Some people love them, but others still go for the raw loops, completely avoiding <algorithm> altogether. I typically argue that the algorithms are better because they're self documenting and less likely to cause bugs, but it seems that this isn't a super strong argument as it's met with resistance because many programmers are not accustomed to the algorithms and are more easily able to recognize the common patterns in raw loop code. In understand this but believe that a professional programmer needs both to be able to read and write both.
However, the new parallel STL proposal is an example of where using these high level algorithms can gain you performance. If you need to parallelize some code quickly and that code was written with std:: algorithms, then they can be trivially parallelized by passing std::par to the algorithms. This is not a silver bullet if you want to maximize hardware utilization, but it's an easy way to gain performance with little effort; all because of the interface.
So what does this have to do with our parser? We're not going to make it faster via parallelism, but we are going to take advantage of a common interface.
One thing that each parser implementation is doing is making a copy of the sub string into the std::set. It would be nice if we could remove that.
Enter boost::string_ref
boost::string_ref is the basis for the proposed std::string_view, which implements a non-owning reference to a string. What's nice about this is the string_ref can be used to represent a slice of the owning string. To use it, we change the interfaces of the parser functions to use string_ref instead of std::string. Here's a modified xpressive_csv;
We've changed the function signature for both the input and the resulting std::set. This may not be possible in all situations because we are now forcing the caller to hold onto a copy of the original string and changed the type we're putting the results into. But because string_ref has a very similar interface to std::string, this will be possible in many situations with little or no changes for the user. We've also had to modify the regex types to use the iterator type from string_ref because smatch, etc are really just typedefs of the base template for std::string iterators. There are also typedefs for char* (cmatch, et al), which we could use but it seemed better to use the string_ref iterators.
Running this code is about 30% faster than before and about 10% faster than the original split method. Split is now also 30% faster than and running in a mere 3 seconds. Here's that code;
And the corresponding timings for both;
The next logical step from here is to replace the std::set with something a little better. My first thought was to use something like boost::container::flat_set, which is essentially a sorted array with a std::set interface. Using this container is a straight forward transformation, but doing so results in extremely slow code.
This actually takes so long to run that I had to cancel it. The problem is that flat_set insertion is very slow because, being a sorted array, inserting out of order is an O(N^2) operation. The solution is to accumulate the results into a separate list, sort that and then insert into flat_set in order. Both boost flat_set and flap_map expose separate functions for doing this precise thing; inserting a list of sorted unique items.
However, running this code shows that it's actually slower. This is where constant time factors come in. Both solutions are N*log(N), but the set insertion starts with a smaller number of elements in it so it's actually log(1) + log(2) ... log(N), which converges on N log(N). std::sort is working from a full N right from the get go and that's a real difference. There are still potential benefits to using this technique though. One was mentioned earlier in regard to the parallel algorithm; we could invoke a parallel sort on the result and speed that up significantly. Another is related to how the data is consumed post parse; if this set is essentially immutable after the parse, it might be better to sacrifice the parse speed for better performance when consuming the data; something that flat_set can make a very real difference with because it utilizes the cache much more effectively. But that discussion is for another time.
A quick refresher
In the previous posts we were able to achieve a maximum tokenization speed of 4.5 seconds for 10 millions comma separated values. This was using the split algorithm from boost, plus some manual whitespace trimming. The next fastest solution was using Boost.Xpressive to apply a regex tokenize to the string. The code for both versions are available here and here, which shows that the Xpressive solution is considerably shorter than the Split version, though it was 25% slower. As we know, less code usually means fewer bugs, so we really would like to keep the sorter version if possible.
But, what if we really need it to be less than 5 seconds? Assuming one currently has a working version of the Xpressive tokenizer, the first attempt should be to optimize that instead of writing something new. But how? There's only 3 lines of code!
Hiding behind your interface
High level data structures and algorithms present an interface to a user for 'doing something'. If these interfaces are well designed and consistent, then it becomes possible to substitute one data structure or algorithm for another when requirements change. Most of the STL is designed this way with consistent push_back, insert, begin/end, etc. It's not perfect but it's certainly very good.
The STL algorithms are something that are still a contentious issue in many real C++ shops. Some people love them, but others still go for the raw loops, completely avoiding <algorithm> altogether. I typically argue that the algorithms are better because they're self documenting and less likely to cause bugs, but it seems that this isn't a super strong argument as it's met with resistance because many programmers are not accustomed to the algorithms and are more easily able to recognize the common patterns in raw loop code. In understand this but believe that a professional programmer needs both to be able to read and write both.
However, the new parallel STL proposal is an example of where using these high level algorithms can gain you performance. If you need to parallelize some code quickly and that code was written with std:: algorithms, then they can be trivially parallelized by passing std::par to the algorithms. This is not a silver bullet if you want to maximize hardware utilization, but it's an easy way to gain performance with little effort; all because of the interface.
So what does this have to do with our parser? We're not going to make it faster via parallelism, but we are going to take advantage of a common interface.
One thing that each parser implementation is doing is making a copy of the sub string into the std::set. It would be nice if we could remove that.
Enter boost::string_ref
boost::string_ref is the basis for the proposed std::string_view, which implements a non-owning reference to a string. What's nice about this is the string_ref can be used to represent a slice of the owning string. To use it, we change the interfaces of the parser functions to use string_ref instead of std::string. Here's a modified xpressive_csv;
We've changed the function signature for both the input and the resulting std::set. This may not be possible in all situations because we are now forcing the caller to hold onto a copy of the original string and changed the type we're putting the results into. But because string_ref has a very similar interface to std::string, this will be possible in many situations with little or no changes for the user. We've also had to modify the regex types to use the iterator type from string_ref because smatch, etc are really just typedefs of the base template for std::string iterators. There are also typedefs for char* (cmatch, et al), which we could use but it seemed better to use the string_ref iterators.
Running this code is about 30% faster than before and about 10% faster than the original split method. Split is now also 30% faster than and running in a mere 3 seconds. Here's that code;
And the corresponding timings for both;
What about the set?
The next logical step from here is to replace the std::set with something a little better. My first thought was to use something like boost::container::flat_set, which is essentially a sorted array with a std::set interface. Using this container is a straight forward transformation, but doing so results in extremely slow code.
This actually takes so long to run that I had to cancel it. The problem is that flat_set insertion is very slow because, being a sorted array, inserting out of order is an O(N^2) operation. The solution is to accumulate the results into a separate list, sort that and then insert into flat_set in order. Both boost flat_set and flap_map expose separate functions for doing this precise thing; inserting a list of sorted unique items.
However, running this code shows that it's actually slower. This is where constant time factors come in. Both solutions are N*log(N), but the set insertion starts with a smaller number of elements in it so it's actually log(1) + log(2) ... log(N), which converges on N log(N). std::sort is working from a full N right from the get go and that's a real difference. There are still potential benefits to using this technique though. One was mentioned earlier in regard to the parallel algorithm; we could invoke a parallel sort on the result and speed that up significantly. Another is related to how the data is consumed post parse; if this set is essentially immutable after the parse, it might be better to sacrifice the parse speed for better performance when consuming the data; something that flat_set can make a very real difference with because it utilizes the cache much more effectively. But that discussion is for another time.