January | 2015 | Programming with Jim

C++ virtual functions are slow compared to regular functions. This is what happens when you call a regular function:

– put arguments in registers and on the stack
– jump to a constant location

And this is what happens when you call a virtual function:

– follow the object’s vtable pointer to the virtual function table
– look up the needed function address
– put arguments in registers and on the stack
– jump to the loaded location

Agner Fog’s indispensable Optimizing Software in C++ says:

The time it takes to call a virtual member function is a few clock cycles more than it takes to
call a non-virtual member function, provided that the function call statement always calls the
same version of the virtual function. If the version changes then you may get a misprediction
penalty of 10 – 20 clock cycles.

The extra work does not take 20 instructions, but the extra memory accesses and unpredictable jump location stall the processor. However, those extra cycles fail to capture another difference: it is impossible to inline a virtual function call. If the concrete function is inline-able, its advantage over the virtual call is much greater.

For example: suppose we need to process an ordered list of numbers. In the simplest case, the list is an array. In the more complicated case, the list might be a “filter” that skips over some elements of another array, or perhaps an arbitrary permutation of the elements in another array. The different cases are user-controlled so we must delay the choice until runtime. It’s a classic use case for virtual functions.

The filter and permutation might not feel the virtual penalty very hard. The filter does a lot of work to find the next valid element in the array. The permutation burns dozens or hundreds of cycles on cache misses. In either case, the virtual penalty is probably insignificant by comparison.

However, the simple case is nothing but a pointer increment and access. Automatic cache prefetches will keep its memory access quick. It’s under 5 cycles’ worth of work. If it were called non-virtually, the compiler would inline it. So now we’re not comparing a virtual call against a regular call — we’re comparing a virtual call against *no call*.

Inlined functions don’t need to push the caller-save registers on the stack or make any jumps whatsoever. They also open up opportunities for compiler optimizations like loop unrolling, strength reduction, constant folding, or dead code elimination. The difference between a virtual call and an inline function could be huge.

I wrote a benchmark to find out. We will construct virtual and non-virtual versions of the struct below and repeatedly call their double get(int i) methods loops. We will also test two methods that could be aggressively optimized away by a compiler.

struct ArrayAccessor
{
	ArrayAccessor(double const *a) : a(a) {}
	double get(int i) const { return a[i]; }
	double getDead(int i, SpaceWaster sw) const { return a[i]; }
	double getSR(int i, int div) const { return a[i / div]; }
	double const *a;
};

The first potential optimization is dead code elimination. getDead takes an extra stack parameter that it does not use. Fully optimized code should never bother copying the struct to the stack. The second is strength reduction. getSR takes an extra integer parameter and divides by it, but in the benchmark the parameter value will be 2, so the division could be optimized into a right shift.

When writing benchmarks, we often must add extra silly-looking code to make sure the optimizer can’t transform our program into something completely different. Compilers can optimize a virtual call into a concrete call if they can prove the exact method being called at compile time. This benchmark required some extra work to make sure the program actually performs a virtual call. It contains two classes with identical methods as shown above. ArrayAccessorNoInherit has no virtual methods and is not part of an inheritance hierarchy. ArrayAccessor inherits from a dummy base class with the same method signatures. When the virtual benchmark runs, it chooses between ArrayAccessor and the dummy by branching on rand() == 0. So in practice we will (nearly) always pick the “real” version, but the compiler can’t prove that, so it must generate code with the full virtual call mechanism in effect. And of course, we must compute something from the return values and print it to the console, otherwise the entire program might get optimized away *.

The full benchmark code (feel free to skip to results at end):

#include <algorithm>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <memory>
#include <random>

struct SpaceWaster { int i[8]; };

struct NumList
{
	virtual double get(int i) const { return 100; }
	virtual double getDead(int i, SpaceWaster sw) const { return 100; }
	virtual double getSR(int i, int div) const { return 100; }
};

struct ArrayAccessor : public NumList
{
	ArrayAccessor(double const *a) : a(a) {}
	double get(int i) const override { return a[i]; }
	double getDead(int i, SpaceWaster sw) const override { return a[i]; }
	double getSR(int i, int div) const override { return a[i / div]; }
	double const *a;
};

struct ArrayAccessorNoInherit
{
	ArrayAccessorNoInherit(double const *a) : a(a) {}
	double get(int i) const { return a[i]; }
	double getDead(int i, SpaceWaster sw) const { return a[i]; }
	double getSR(int i, int div) const { return a[i / div]; }
	double const *a;
};

int const N_TESTS = 10000000;
int const ARR_SIZE = 100;
int const OPS = N_TESTS * ARR_SIZE;

void prettyPrint(double seconds, double total)
{
	using namespace std;
	double CPU_HZ = 3.4 * 1000 * 1000 * 1000;
	double cycPerOp = (CPU_HZ * seconds) / OPS;
	cout << fixed << right << setprecision(2) << setw(6)
	     << seconds  << " Seconds, " 
	     << fixed << right << setprecision(2) << setw(6)
	     << cycPerOp << " Cycles/Op"
	     << "  (result = " << defaultfloat << total << ")\n";;
}

template <typename T>
void test(T *nl, double const *nums, int size)
{
	using namespace std::chrono;
	typedef high_resolution_clock clock;
	auto start = clock::now();
	auto seconds = [&amp;]()
		{ return duration<double>(clock::now() - start).count(); };

	start = clock::now();
	double total = 0;
	for (int test = 0; test < N_TESTS; ++test) {
		for (int i = 0; i < size; ++i) {
			total += nl->get(i);
		}
	}
	std::cout << "            Plain: ";
	prettyPrint(seconds(), total);

	start = clock::now();
	total = 0;
	SpaceWaster sw {};
	for (int test = 0; test < N_TESTS; ++test) {
		for (int i = 0; i < size; ++i) {
			total += nl->getDead(i, sw);
		}
	}
	std::cout << "         DeadCode: ";
	prettyPrint(seconds(), total);

	start = clock::now();
	total = 0;
	for (int test = 0; test < N_TESTS; ++test) {
		for (int i = 0; i < size; ++i) {
			total += nl->getSR(i, 2);
		}
	}
	std::cout << "StrengthReduction: ";
	prettyPrint(seconds(), total);
}

int main()
{
	double nums[ARR_SIZE];
	std::default_random_engine engine(42);
	std::uniform_real_distribution<double> distrib;
	std::generate_n(nums, ARR_SIZE, [&](){ return distrib(engine); });

	std::unique_ptr<NumList> virt;
	if (rand() == 0) {
		virt.reset(new NumList());
		std::cout << "An improbable event occurred, please try again.\n";
		exit(1);
	}
	else {
		virt.reset(new ArrayAccessor(nums));
	}
	std::cout << "Virtual:\n";
	test(virt.get(), nums, ARR_SIZE);

	std::unique_ptr<ArrayAccessorNoInherit> conc(nums);
	std::cout << "\nConcrete:\n";
	test(conc.get(), nums, ARR_SIZE);
}

I compiled with MSVC 2013 in 64-bit Release mode, optimizing for speed. My results:

Virtual:
            Plain:   1.46 Seconds,   4.98 Cycles/Op  (result = 4.5e+008)
         DeadCode:   1.72 Seconds,   5.86 Cycles/Op  (result = 4.5e+008)
StrengthReduction:   3.39 Seconds,  11.52 Cycles/Op  (result = 4.3e+008)

Concrete:
            Plain:   0.84 Seconds,   2.85 Cycles/Op  (result = 4.5e+008)
         DeadCode:   0.84 Seconds,   2.85 Cycles/Op  (result = 4.5e+008)
StrengthReduction:   0.84 Seconds,   2.86 Cycles/Op  (result = 4.3e+008)

This benchmark clearly demonstrates the issues discussed in the first part of this post. In the plain array-access case, we see nearly 2x overhead from making the virtual call. The penalty is about 2 cycles per operation, consistent with Agner’s measurements. But it naturally gets worse in the optimizable cases. The optimizer successfully performs dead code elimination and strength reduction on the concrete function call, but is unable to do so on the virtual call. The strength reduction from division to shift shows the most dramatic difference. Examining the assembly output also showed that the concrete version is unrolled into a 10-iteration loop.

A penalty of 2 cycles is almost nothing. Certainly on a reasonable time scale like once per frame in a game or once per mouse movement, we should not care about it. But if the function gets called millions of times in a hot loop, the difference matters. Furthermore, if there is any possibility for inlining and subsequent compiler optimizations, the penalty could get bigger. We should try to avoid virtual calls in performance-critical areas.

In the next post, I will share some thoughts about maintaining runtime polymorphism while improving performance on this kind of workload. I will also try to reproduce Agner’s numbers for unpredictable virtual calls, and benchmark some virtual functions that do more work to see if the overhead is still measurable.

* Yes, this actually happens to the concrete version if we don’t print the result to the console. The whole program gets optimized to nothing and the reported time for the Concrete benchmark is 0.

Programming with Jim

Monthly Archives: January 2015

Virtual functions: still slow