5: Benchmarking Generators (& Conclusion)
Coroutines: Basics, Py->C++20,
Benchmarking
- Introduction
- What are coroutines?
- What coroutines can look like in C++
- Benchmark generators: C++03, C++20 & Python
This is a mini-series leading up to implementing coroutines in C++20, (“what I learnt on my holiday”). It also has comparison with other implementations, as part of my ModernCPP explorations. (first post)
So far in this series we’ve covered what coroutines are, what (python PEP255 style) simple generators are, how they can look in C++, looked at C++20 and C++03 implementations. The question arises - what is the performance?
Coroutines and generators can be used for things a lot more interesting than fibonacci sequence generation. However it it does form a useful basis for benchmarking the cost of:
- Periodic creation
- Yielding values back
… at least when comparing like with like. Primarily it gives you a “cost” for context switching - since you switch contexts when a value is yielded and consumed.
Given this I’ve done some very simplistic benchmarking, specifically:
- To time how long it takes to run a function a billion times that creates a generator object extracted the first 22 values from it.
The harness for each of implementation follows.
The C++03 harness:
#include "cpp03generators.hpp"
#include <iostream>
class Fibonnaci: public Generator {
int a, b, s;
public:
() : a(1), b(1), s(0) { }
Fibonnaciint next() {
GENERATOR_CODE_STARTwhile ( true ) {
(a);
YIELD= a + b;
s = b;
a = s;
b };
GENERATOR_CODE_END};
};
void timeFibonnaci() {
;
Fibonnaci afor(int j=0; j<22; j++) {
try {
.next();
a} catch(StopIteration null){
std::cout << " Exception Caught" << "...\n";
break;
}
};
}
int main(int, char **) {
for(long k=0; k<1000; k++) {
for(long j=0; j<1000; j++) {
for(long i=0; i<1000; i++) {
();
timeFibonnaci}
}
}
return 0;
}
The C++20 harness:
#include "cpp20generators.hpp"
<int> fib(int max) {
simple_generatorint a{1}, b{1}, n{0};
for(int i=0; i< max; i++) {
co_yield a;
= a + b;
n = b;
a = n;
b }
}
void timeFibonnaci() {
auto a = fib(22);
for(int j=0; j<22; j++) {
if (a.running()) {
.try_next();
aif (a.running()) {
.take();
a}
}
}
}
int main(int, char **) {
for(long k=0; k<1000; k++) {
for(long j=0; j<1000; j++) {
for(long i=0; i<1000; i++) {
();
timeFibonnaci}
}
}
return 0;
}
Finally, python harness:
def fib():
= 0, 1, 1
n, a, b while n<91: # Match C++, not really needed
yield a
= b, a+b
a, b +=1
n
def timeFibonnaci():
= fib()
f for i in range(22):
__next__()
f.
for k in range(1000):
for j in range(1000):
for i in range(1000):
timeFibonnaci()
Benchmark Results
This is the output for each of the 3 test harness using the linux
time
command.
C++03 | C++20 | Python | |
---|---|---|---|
real | 1m47.686s | 16m21.805s | 54m27.302s |
user | 1m47.683s | 16m21.680s | 54m26.746s |
sys | 0m0.000s | 0m0.048s | 0m0.080s |
Note:
- Assume python is our baseline
- C++20 version is 3.3 times faster than python version
- C++03 version is 30.5 times faster than python version
- C++03 version is 9.17 times faster than C++20 version
This really surprised me. I expected the C++03 approach to be quicker because it does less, but not that much faster! Secondly, while I expected the C++20 code to be faster than python, I didn’t expect it to be just 3.3 times faster.
That said, I don’t think this is enough of a bad strike against the C++20 version to discount using C++20 coroutines.
However, it does give pause.
Where’s the code?
Update: 11/4/2023
Want to read and runcode, not the blog? I know the feeling. I recently created a github repo specifically for putting code from blog posts in. Though recognising that it might also have non-code stuff added at some point, I’ve called it blog-extras.
The blog-extras repo can be found here:
The code specific to this series of short posts can be found here:
That repo will get fleshed out with code/examples from other posts in this blog too.
Discussion, some Conclusions
What matters most in implementations? There can be a lot of possible answers to this. These are a few that spring to might.
- Idiomatic for the language
- Clarity
- Debuggable
- Correctness
- Speed
Out of those points, given coroutines are a mechanism for implementing concurrency, I think clarity and correctness are the most important. On each point:
Idiomatic for the language - the C++20 version is clearly idiomatic for C++. While the C++03 version does build on C++ features that have longevity, it is definitely not idiomatic code. C++20
Clarity - I think this is perhaps measurable by the ideas “what does someone who has only passing familiarity with C++ features think this does?”. On the one hand the actual body of the coroutines in both C++03 and C++20 is similar level of “surprise” for the average developer, but given the prevalence of generators/coroutines in the python world due to async, I think the C++20 version has the edge here.
The fly in the ointment here is that the actual behaviour depends on the specific implementation driving the handling of
co_yield
etc, which means the C++03 code is perhaps clearer there - simply because it’s clear it’s doing something a bit odd.I think this is a tie between C++20 and C++03
Debuggable - unclear. The fact the C++03 version is a lot shorter and is a think wrapper means it might be easier to determine what’s gone wrong if things do. Slight win (maybe) for C++03.
Correctness - Similar points to debuggable. However, the bulk of ensuring code is correct means “less to get wrong”. I think that with the C++20 version the average user is less likely to make mistakes with the C++20 version, so I suspect the C++20 version is better in that regard. Slight win for C++20?
Speed - This is the point of this post, so let’s cover this in a bit of detail.
On the upside, C++20 coroutines are still faster than python - just not as fast as we expected or would like really! We’re not measuring raw language speed, but rather the overhead of a certain kind of context switch. This simple benchmark isn’t great, however I’d really expected the C++20 code to be similar in speed to the C++03 code. I definitely expected it to be faster than it is. Maybe it can be, but nonetheless this was very surprising.
The C++03 code essentially uses an object version of Duff’s device to build it’s coroutines, rather than the C99 style. There’s definitely some advantages to clarity - since essentially you’re building a state machine and using pre-processor Macros to hold it together. This leaks through to the code created by someone using these snippets. This could be prone to error.
By contrast the C++20 code has a fair amount of heavyweight shenanigans going on under the hood - but the result in terms of code clarity can’t really be denied. If I was a maintainer, I might prefer the C++20 version to debug and understand over the C++03 version. It’s problematic for performance - which is an issue in a language primarily used for performance. (Not exclusively - embedded systems spring to mind)
If we were ever in the situation that required raw speed in the context switch, if we could (perhaps) provide a common API to both the C++03 and C++20 coroutines, then we have a mechanism for performing optimisations should they become necessary.
What this doesn’t mean
This doesn’t mean that C++20 is slow. It means that the infrastructure around co-routines in C++20 is more complex. C++20 defers more decisions to the user of the functionality. This benchmark is (I think, I hope) testing that infrastructure difference cost - and reflects the increased complexity.
It’s likely that real world performance between C++20 and C++03 coroutines with realistic workloads will be similar. After all, the bulk of someone’s code is likely to be application related activities rather than inside that infrastructure. That’s a guess though at this stage. This is fairly simple to check though: by building similar systems with both and then benchmarking at a systems level.
OK, that’s twice as much work as I intended but it should be useful and the results are likely to be interesting.
Moving Forward
Going forward when C++-ifying the mini-guild tutorial, I’ll probably use both languages when implementing the steps involved.
Furthermore, when taking this forward, it does suggest though is that going forward and implementing my guild API (not just miniguild) in C++ would be best done in C++03 and C++20 in parallel - simply because of this huge performance difference. If the difference was more like “within 10-30% between the two versions” that’d enable focus on just one implementation, but this is a bigger difference than expected.
I’ll probably come to this after (or maybe alongside) the other posts in this series about ModernC++ as and when I write them.
Anyway, that’s this mini-series over. I learnt a fair amount about C++20 writing the code for it and then explaining it - I hope it was of interest. Let me know?
Updated: 2023/09/15 21:12:35