Virtual functions: still slow

January 27, 2015 by japreiss

C++ virtual functions are slow compared to regular functions. This is what happens when you call a regular function:

– put arguments in registers and on the stack
– jump to a constant location

And this is what happens when you call a virtual function:

– follow the object’s vtable pointer to the virtual function table
– look up the needed function address
– put arguments in registers and on the stack
– jump to the loaded location

Agner Fog’s indispensable Optimizing Software in C++ says:

The time it takes to call a virtual member function is a few clock cycles more than it takes to
call a non-virtual member function, provided that the function call statement always calls the
same version of the virtual function. If the version changes then you may get a misprediction
penalty of 10 – 20 clock cycles.

The extra work does not take 20 instructions, but the extra memory accesses and unpredictable jump location stall the processor. However, those extra cycles fail to capture another difference: it is impossible to inline a virtual function call. If the concrete function is inline-able, its advantage over the virtual call is much greater.

For example: suppose we need to process an ordered list of numbers. In the simplest case, the list is an array. In the more complicated case, the list might be a “filter” that skips over some elements of another array, or perhaps an arbitrary permutation of the elements in another array. The different cases are user-controlled so we must delay the choice until runtime. It’s a classic use case for virtual functions.

The filter and permutation might not feel the virtual penalty very hard. The filter does a lot of work to find the next valid element in the array. The permutation burns dozens or hundreds of cycles on cache misses. In either case, the virtual penalty is probably insignificant by comparison.

However, the simple case is nothing but a pointer increment and access. Automatic cache prefetches will keep its memory access quick. It’s under 5 cycles’ worth of work. If it were called non-virtually, the compiler would inline it. So now we’re not comparing a virtual call against a regular call — we’re comparing a virtual call against *no call*.

Inlined functions don’t need to push the caller-save registers on the stack or make any jumps whatsoever. They also open up opportunities for compiler optimizations like loop unrolling, strength reduction, constant folding, or dead code elimination. The difference between a virtual call and an inline function could be huge.

I wrote a benchmark to find out. We will construct virtual and non-virtual versions of the struct below and repeatedly call their double get(int i) methods loops. We will also test two methods that could be aggressively optimized away by a compiler.

struct ArrayAccessor
{
	ArrayAccessor(double const *a) : a(a) {}
	double get(int i) const { return a[i]; }
	double getDead(int i, SpaceWaster sw) const { return a[i]; }
	double getSR(int i, int div) const { return a[i / div]; }
	double const *a;
};

The first potential optimization is dead code elimination. getDead takes an extra stack parameter that it does not use. Fully optimized code should never bother copying the struct to the stack. The second is strength reduction. getSR takes an extra integer parameter and divides by it, but in the benchmark the parameter value will be 2, so the division could be optimized into a right shift.

When writing benchmarks, we often must add extra silly-looking code to make sure the optimizer can’t transform our program into something completely different. Compilers can optimize a virtual call into a concrete call if they can prove the exact method being called at compile time. This benchmark required some extra work to make sure the program actually performs a virtual call. It contains two classes with identical methods as shown above. ArrayAccessorNoInherit has no virtual methods and is not part of an inheritance hierarchy. ArrayAccessor inherits from a dummy base class with the same method signatures. When the virtual benchmark runs, it chooses between ArrayAccessor and the dummy by branching on rand() == 0. So in practice we will (nearly) always pick the “real” version, but the compiler can’t prove that, so it must generate code with the full virtual call mechanism in effect. And of course, we must compute something from the return values and print it to the console, otherwise the entire program might get optimized away *.

The full benchmark code (feel free to skip to results at end):

#include <algorithm>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <memory>
#include <random>

struct SpaceWaster { int i[8]; };

struct NumList
{
	virtual double get(int i) const { return 100; }
	virtual double getDead(int i, SpaceWaster sw) const { return 100; }
	virtual double getSR(int i, int div) const { return 100; }
};

struct ArrayAccessor : public NumList
{
	ArrayAccessor(double const *a) : a(a) {}
	double get(int i) const override { return a[i]; }
	double getDead(int i, SpaceWaster sw) const override { return a[i]; }
	double getSR(int i, int div) const override { return a[i / div]; }
	double const *a;
};

struct ArrayAccessorNoInherit
{
	ArrayAccessorNoInherit(double const *a) : a(a) {}
	double get(int i) const { return a[i]; }
	double getDead(int i, SpaceWaster sw) const { return a[i]; }
	double getSR(int i, int div) const { return a[i / div]; }
	double const *a;
};

int const N_TESTS = 10000000;
int const ARR_SIZE = 100;
int const OPS = N_TESTS * ARR_SIZE;

void prettyPrint(double seconds, double total)
{
	using namespace std;
	double CPU_HZ = 3.4 * 1000 * 1000 * 1000;
	double cycPerOp = (CPU_HZ * seconds) / OPS;
	cout << fixed << right << setprecision(2) << setw(6)
	     << seconds  << " Seconds, " 
	     << fixed << right << setprecision(2) << setw(6)
	     << cycPerOp << " Cycles/Op"
	     << "  (result = " << defaultfloat << total << ")\n";;
}

template <typename T>
void test(T *nl, double const *nums, int size)
{
	using namespace std::chrono;
	typedef high_resolution_clock clock;
	auto start = clock::now();
	auto seconds = [&amp;]()
		{ return duration<double>(clock::now() - start).count(); };

	start = clock::now();
	double total = 0;
	for (int test = 0; test < N_TESTS; ++test) {
		for (int i = 0; i < size; ++i) {
			total += nl->get(i);
		}
	}
	std::cout << "            Plain: ";
	prettyPrint(seconds(), total);

	start = clock::now();
	total = 0;
	SpaceWaster sw {};
	for (int test = 0; test < N_TESTS; ++test) {
		for (int i = 0; i < size; ++i) {
			total += nl->getDead(i, sw);
		}
	}
	std::cout << "         DeadCode: ";
	prettyPrint(seconds(), total);

	start = clock::now();
	total = 0;
	for (int test = 0; test < N_TESTS; ++test) {
		for (int i = 0; i < size; ++i) {
			total += nl->getSR(i, 2);
		}
	}
	std::cout << "StrengthReduction: ";
	prettyPrint(seconds(), total);
}

int main()
{
	double nums[ARR_SIZE];
	std::default_random_engine engine(42);
	std::uniform_real_distribution<double> distrib;
	std::generate_n(nums, ARR_SIZE, [&](){ return distrib(engine); });

	std::unique_ptr<NumList> virt;
	if (rand() == 0) {
		virt.reset(new NumList());
		std::cout << "An improbable event occurred, please try again.\n";
		exit(1);
	}
	else {
		virt.reset(new ArrayAccessor(nums));
	}
	std::cout << "Virtual:\n";
	test(virt.get(), nums, ARR_SIZE);

	std::unique_ptr<ArrayAccessorNoInherit> conc(nums);
	std::cout << "\nConcrete:\n";
	test(conc.get(), nums, ARR_SIZE);
}

I compiled with MSVC 2013 in 64-bit Release mode, optimizing for speed. My results:

Virtual:
            Plain:   1.46 Seconds,   4.98 Cycles/Op  (result = 4.5e+008)
         DeadCode:   1.72 Seconds,   5.86 Cycles/Op  (result = 4.5e+008)
StrengthReduction:   3.39 Seconds,  11.52 Cycles/Op  (result = 4.3e+008)

Concrete:
            Plain:   0.84 Seconds,   2.85 Cycles/Op  (result = 4.5e+008)
         DeadCode:   0.84 Seconds,   2.85 Cycles/Op  (result = 4.5e+008)
StrengthReduction:   0.84 Seconds,   2.86 Cycles/Op  (result = 4.3e+008)

This benchmark clearly demonstrates the issues discussed in the first part of this post. In the plain array-access case, we see nearly 2x overhead from making the virtual call. The penalty is about 2 cycles per operation, consistent with Agner’s measurements. But it naturally gets worse in the optimizable cases. The optimizer successfully performs dead code elimination and strength reduction on the concrete function call, but is unable to do so on the virtual call. The strength reduction from division to shift shows the most dramatic difference. Examining the assembly output also showed that the concrete version is unrolled into a 10-iteration loop.

A penalty of 2 cycles is almost nothing. Certainly on a reasonable time scale like once per frame in a game or once per mouse movement, we should not care about it. But if the function gets called millions of times in a hot loop, the difference matters. Furthermore, if there is any possibility for inlining and subsequent compiler optimizations, the penalty could get bigger. We should try to avoid virtual calls in performance-critical areas.

In the next post, I will share some thoughts about maintaining runtime polymorphism while improving performance on this kind of workload. I will also try to reproduce Agner’s numbers for unpredictable virtual calls, and benchmark some virtual functions that do more work to see if the overhead is still measurable.

* Yes, this actually happens to the concrete version if we don’t print the result to the console. The whole program gets optimized to nothing and the reported time for the Concrete benchmark is 0.

C++ Wish List: Constify

December 17, 2013 by japreiss

Sometimes you want to mutate a value for a little while, and then freeze it. Reasons include:

– Interfacing with an API that likes to mutate its arguments
– Doing an in-place algorithm such as sorting
– It’s just easier to express the algorithm that way

But this kind of error occurs:

	void fill_array(float *values, int length);

	float *to_be_mutated = new float[42];
	float *to_be_sorted = new float[42];

	fill_array(to_be_mutated);
	sort_array(to_be_mutated); // oh no! should have passed to_be_sorted!

These kind of errors are pretty common in my experience, especially if variables with compatible types are similarly named.

Wouldn’t it be cool if you could tell the C++ compiler to convert a local variable into a const variable for the remainder of the function?

	void fill_array(float *values, int length);

	float *to_be_mutated = new float[42];
	float *to_be_sorted = new float[42];

	fill_array(to_be_mutated);
	constify to_be_mutated;

	sort_array(to_be_mutated); // compile error!

I don’t know much about compilers, but this seems like it would be easy to implement.

C++ Wish List: Multiple return values

September 3, 2013 by japreiss

I really like Python’s multiple return values, aka automatic tuple unpacking:

t, p = project_point(curve, point)
# t is the parameter value of the projected point, p is the actual point

In C++ we must do something ugly with references like this:

Point p;
double t = project_point(curve, point, p);

Or use tuples everywhere:

std::tuple<double, Point> tp = project_point(curve, point);

If C++ copied Python’s syntax, we might see something like this:

Point p, double t = project_point(curve, point);

But that could get ugly fast with longer type names. Instead, what about the following syntax:

empty Point p; // Point() constructor isn't called
empty double t;
t, p = project_point(curve, point);

I like it. We’d have to overload some reserved word instead of using empty, but I’m sure the committee could find a way to make it work.

I Hate Singletons

July 26, 2013 by japreiss

I came across an egregiously bad use of singletons in my company’s code base today.

There is a global Progress singleton for long-running operations to display their progress in the UI.

Every big algorithm updates the Progress singleton, even if its running time is close to instantaneous on small input data.

The “Progress update handler” listens to the Progress singleton. When the progress value changes, it updates a progress bar in the GUI.

All GUI toolkits I know of are single-threaded. In our case, WinForms throws an exception when you try to mess around with the GUI from different threads.

This means none of our big algorithms can run on a different thread from the GUI.

SMH

Of course, this could all be fixed if the algorithms took an optional pointer to a Progress instance, but somebody thought the Singleton was nicer than passing around one extra pointer.

Since that would be a lot of work, the best band-aid is checking the WinForms control’s InvokeRequired property in the Progress update handler, and simply doing nothing if a non-GUI thread is trying to update it.

Dependency Injection in C++ is hard

July 21, 2013 by japreiss

When I first watched Miško Hevery’s Clean Code talks, I had been working for almost a year on my first large-scale programming project in the real world. I had encountered some difficulties with unit testing along the way. Some of my tests depended on nondeterministic things like interprocess communication or timers, like this camera animator class:

class CameraAngleAnimator
{
public:
    CameraAngleAnimator(double topSpeed);
    void setGoal(Quaternion const &angle);
    Quaternion currentAngle();

private:
    Clock clock_;
    Quaternion goal_;
    Quaternion lastAngle_;
    Time lastTime_;
    double topSpeed_;
};

In this class, the camera moves towards the goal up to a given top speed. Each time we’re ready to draw graphics, we query the currentAngle(). The class looks at how much time has elapsed since the last update, and moves the camera accordingly.

My unit tests looked like this:

CameraAngleAnimator animator(M_PI); // 180 degrees per second
Quaternion start(Vector(1, 0, 0), 0);
Quaternion end(Vector(1, 0, 0), M_PI);
a.setGoal(start); // first call - snaps to position immediately
a.setGoal(end);
Sleep(100); // uh-oh!
Quaternion partway = a.currentAngle();
assert(partway != start);
assert(partway != end);

This kind of unit test worked most of the time, but of course it would sometimes fail mysteriously because Sleep() is not guaranteed to be accurate, and who knows what other processes might be using up the CPU?

Furthermore, I had to keep the Sleep() durations well above the scheduler’s time slice. I think I needed about 15-millisecond intervals to get stability. A few of those tests add up to make the test suite slow.

The fix is to follow Miško’s advice, which boils down to:

Ask for your dependencies instead of constructing them yourself.

class IClock
{
    virtual Time now() = 0;
};

class CameraAngleAnimator
{
public:
    CameraAngleAnimator(double topSpeed, IClock &clock);
...
};

This allows us to make a implementation of IClock for unit testing that supplies deterministic results.

image from lisbokt

A bigger class might have 3 or 4 dependencies. For users, it’s annoying to construct those dependencies and pass them in. The code gets cluttered. Why should I deal with all this crap for the sake of their unit tests?

Java (, C#, Python, Ruby, …) programmers would typically use a factory to hide the dependencies from normal usage:

class CameraAngleAnimatorFactory
{
    public CameraAngleAnimator create(double topSpeed) {
        return new CameraAngleAnimator(topSpeed, new Clock());
    }
};

But wait a second! Before the change, the CameraAngleAnimator owned its Clock. It was a concrete, non-pointer data member. Everything was simple. No custom destructor or copy/assignment behavior was needed. Who owns the IClock now?

In garbage-collected languages, this is never an issue because every object is stored on the heap and garbage-collected when it’s not needed. C++ is not as easy.

We can’t use a concrete data member because we have no idea what class we’re storing.
A reference makes a default dependency or factory impossible: whoever owns the CameraAngleAnimator has to own the IClock as well.
An owned raw pointer begs for classic C/C++ heap errors: do you call delete in the destructor? If so, you’ll probably delete something twice by mistake, or try to delete a stack variable. If not, you’ll probably leak something. A unique_ptr is better, but still leaves the possibility of double-ownership problems.

In this simple example, we could just add a parameter double time to the currentAngle() call. Indeed, that is probably the best way to do it. But let’s imagine that the dependency is actually more complicated and not well modeled by adding method parameters.

One option is the following:

class CameraAngleAnimator
{
public:
    CameraAngleAnimator(double topSpeed) : clock_(&ownedClock_) {}
    void overrideClock(IClock &clock) { clock_ = &clock; }
    ...

private:
    Clock ownedClock_;
    IClock *clock_;
    ...
};

This version bundles a concrete Clock with each instance but lets you override it with a reference to an external IClock. If unit testing is the only reason we want dependency injection, then this is probably the way to go. However, if we need to override in the “real world”, it will not solve any ownership problems. It’s also not compatible with the Factory pattern because there’s no way for the object to take ownership of the override dependency.

Another option that’s closer to what you get in Java is a shared_ptr.

class CameraAngleAnimator
{
public:
    CameraAngleAnimator(double topSpeed, shared_ptr<IClock> clock);
    ...

private:
    shared_ptr<IClock> clock_;
    ...
};

The default dependency could come from a factory, or the .cpp implementation file could construct a default Clock if the parameter is null.

Compared to the reference/concrete version, this doesn’t waste any memory on an inactive data member when the clock is overridden. It frees the user of concern about ownership and lifetimes. But it forces a level of indirection to the heap, and it “infects” the rest of the code base with shared_ptr.

A third option is merging the two approaches:

class CameraAngleAnimator
{
public:
    CameraAngleAnimator(double topSpeed) : clock_(&ownedClock_) {}
    void overrideClock(shared_ptr<IClock> clock) {
        externalClock_ = clock;
        clock_ = clock.get();
    }
    ...

private:
    Clock ownedClock_;
    shared_ptr<IClock> externalClock_;
    IClock *clock_;
    ...
};

This version:

Carries extra memory around when overridden, but
Gets the ease-of-use of automatic memory management,
Works with the factory pattern, and
Doesn’t incur the penalty of heap allocation if the default Clock is used

It’s depressingly complex compared to the Java version though.

A completely different approach is templates, but they are not suitable for my workplace because they don’t play well with our Python and C# wrappers.

I’m undecided. One thing’s for certain: dependency injection in C++ is much less straightforward than in a garbage-collected language.

Edit, Decenber 17, 2013: Fixed a misuse of references vs. pointers.

Programming with Jim

Category Archives: C++