Saturday, 28 March 2015

Gaining Performance Through High Level Abstraction - Part 2

Previously, I discussed how to gain performance by moving to higher level abstractions instead of going lower level as is commonly perceived.

Xpressing the non-obvious


One thing that bothered me about the original post was that, while the optimized version using the tokenizer was probably fast enough, it took quite a bit more code to get a 35% increase in performance.

Since then, I found another high level method that, while not as fast as the split method, offers a reasonable speed up without much or any additional code.  That method is using Boost.Xpressive.

Xpressive is an interesting library in that it implements a domain specific language within C++ which allows for the generation of very optimized regular expressions.

As an experiment, I implemented a version of the original parser using Xpressive.  That code looks like this;

void xpressive_csv(std::string raw_csv, std::set<std::string>& dest_set)
{
using namespace boost::xpressive;
sregex re = +(+alnum);
std::transform(
sregex_iterator(raw_csv.begin(), raw_csv.end(), re),
sregex_iterator(),
std::inserter(dest_set, dest_set.begin()),
[](smatch const& match)
{
return match[0];
}
);
}
What's nice about this is that it's short, clear, and reasonably fast. Here's the timing;

Chris@chris-hppc MINGW64 ~/src/play/tokenizer
$ time ./a.out xpress
xpressing 10000000 csv values.
xpressed 10000000 values.
real 0m5.979s
user 0m0.000s
sys 0m0.015s

This runs about 18% faster than the optimized tokenize version with approximately the same amount of code. The 'split' version, shown below as a refresher, is still about 25% faster than this one, but this is considerably less code, which might be worth it.

void split_csv(std::string raw_csv, std::set<std::string>& dest_set)
{
std::vector<boost::iterator_range<std::string::iterator>> result;
boost::algorithm::split(result, raw_csv, [](char c)
{
return c == ',';
});
boost::transform(result, std::inserter(dest_set, dest_set.begin()),
[](boost::iterator_range<std::string::iterator> r)
{
auto begin = r.begin();
auto end = r.end();
while(begin != end && std::isspace(*begin))
++begin;
while(begin != end && std::isspace(*end))
--end;
return std::string(begin, end);
}
);
dest_set.erase("");
}
view raw split_csv.cpp hosted with ❤ by GitHub
As a further experiment, I implemented an identical solution using std::regex;

void regex_csv(std::string raw_csv, std::set<std::string>& dest_set)
{
std::regex re("\\w+");
std::transform(
std::sregex_iterator(raw_csv.begin(), raw_csv.end(), re),
std::sregex_iterator(),
std::inserter(dest_set, dest_set.begin()),
[](std::smatch const& match)
{
return match.str();
}
);
}
view raw regex_csv.cpp hosted with ❤ by GitHub

This is virtually identical to the Xpressive version, but slower; performance is closer to the original tokenizer version, which is good news for Xpressive and means I'll probably turn to that library if I need to do any regular expression work in the future.

Chris@chris-hppc MINGW64 ~/src/play/tokenizer
$ time ./a.out regex
xpressing 10000000 csv values.
xpressed 10000000 values.
real 0m7.647s
user 0m0.000s
sys 0m0.000s
view raw regex_csv.sh hosted with ❤ by GitHub

Are we there yet?


I've updated the token test file with the added parsers available here.  At this point, you're probably thinking that this isn't optimizable any further without going really raw (like raw pointer raw!).  But this isn't the case, next up is another technique in which high level libraries free us up to change things in order to allow for higher performance.

1 comment:

  1. This is good blog in which programming languages are discussed practically by addressing coding and syntaxes etc. Excellent work of the blogger, simply said that........

    ReplyDelete