Previously, I discussed how to gain performance by moving to higher level abstractions instead of going lower level as is commonly perceived.
One thing that bothered me about the original post was that, while the optimized version using the tokenizer was probably fast enough, it took quite a bit more code to get a 35% increase in performance.
Since then, I found another high level method that, while not as fast as the split method, offers a reasonable speed up without much or any additional code. That method is using Boost.Xpressive.
Xpressive is an interesting library in that it implements a domain specific language within C++ which allows for the generation of very optimized regular expressions.
As an experiment, I implemented a version of the original parser using Xpressive. That code looks like this;
What's nice about this is that it's short, clear, and reasonably fast. Here's the timing;
This runs about 18% faster than the optimized tokenize version with approximately the same amount of code. The 'split' version, shown below as a refresher, is still about 25% faster than this one, but this is considerably less code, which might be worth it.
As a further experiment, I implemented an identical solution using std::regex;
This is virtually identical to the Xpressive version, but slower; performance is closer to the original tokenizer version, which is good news for Xpressive and means I'll probably turn to that library if I need to do any regular expression work in the future.
I've updated the token test file with the added parsers available here. At this point, you're probably thinking that this isn't optimizable any further without going really raw (like raw pointer raw!). But this isn't the case, next up is another technique in which high level libraries free us up to change things in order to allow for higher performance.
Xpressing the non-obvious
One thing that bothered me about the original post was that, while the optimized version using the tokenizer was probably fast enough, it took quite a bit more code to get a 35% increase in performance.
Since then, I found another high level method that, while not as fast as the split method, offers a reasonable speed up without much or any additional code. That method is using Boost.Xpressive.
Xpressive is an interesting library in that it implements a domain specific language within C++ which allows for the generation of very optimized regular expressions.
As an experiment, I implemented a version of the original parser using Xpressive. That code looks like this;
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void xpressive_csv(std::string raw_csv, std::set<std::string>& dest_set) | |
{ | |
using namespace boost::xpressive; | |
sregex re = +(+alnum); | |
std::transform( | |
sregex_iterator(raw_csv.begin(), raw_csv.end(), re), | |
sregex_iterator(), | |
std::inserter(dest_set, dest_set.begin()), | |
[](smatch const& match) | |
{ | |
return match[0]; | |
} | |
); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Chris@chris-hppc MINGW64 ~/src/play/tokenizer | |
$ time ./a.out xpress | |
xpressing 10000000 csv values. | |
xpressed 10000000 values. | |
real 0m5.979s | |
user 0m0.000s | |
sys 0m0.015s |
This runs about 18% faster than the optimized tokenize version with approximately the same amount of code. The 'split' version, shown below as a refresher, is still about 25% faster than this one, but this is considerably less code, which might be worth it.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void split_csv(std::string raw_csv, std::set<std::string>& dest_set) | |
{ | |
std::vector<boost::iterator_range<std::string::iterator>> result; | |
boost::algorithm::split(result, raw_csv, [](char c) | |
{ | |
return c == ','; | |
}); | |
boost::transform(result, std::inserter(dest_set, dest_set.begin()), | |
[](boost::iterator_range<std::string::iterator> r) | |
{ | |
auto begin = r.begin(); | |
auto end = r.end(); | |
while(begin != end && std::isspace(*begin)) | |
++begin; | |
while(begin != end && std::isspace(*end)) | |
--end; | |
return std::string(begin, end); | |
} | |
); | |
dest_set.erase(""); | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
void regex_csv(std::string raw_csv, std::set<std::string>& dest_set) | |
{ | |
std::regex re("\\w+"); | |
std::transform( | |
std::sregex_iterator(raw_csv.begin(), raw_csv.end(), re), | |
std::sregex_iterator(), | |
std::inserter(dest_set, dest_set.begin()), | |
[](std::smatch const& match) | |
{ | |
return match.str(); | |
} | |
); | |
} |
This is virtually identical to the Xpressive version, but slower; performance is closer to the original tokenizer version, which is good news for Xpressive and means I'll probably turn to that library if I need to do any regular expression work in the future.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Chris@chris-hppc MINGW64 ~/src/play/tokenizer | |
$ time ./a.out regex | |
xpressing 10000000 csv values. | |
xpressed 10000000 values. | |
real 0m7.647s | |
user 0m0.000s | |
sys 0m0.000s |
Are we there yet?
I've updated the token test file with the added parsers available here. At this point, you're probably thinking that this isn't optimizable any further without going really raw (like raw pointer raw!). But this isn't the case, next up is another technique in which high level libraries free us up to change things in order to allow for higher performance.
This is good blog in which programming languages are discussed practically by addressing coding and syntaxes etc. Excellent work of the blogger, simply said that........
ReplyDelete