Using Abstraction to Optimize Runtime Polymorphism

Using Abstraction to Optimize Runtime Polymorphism


What is the cost of polymorphism?

As any experienced C++ programmer knows, virtual functions are slower than normal functions because they require more indirection. Each object must have its v-table pointer dereferenced in order to access the v-table itself, and then a table lookup has to be performed to call the actual function. In most situations, this overhead is acceptable and very much outweighed by the benefits of using virtual functions. Also, most modern compilers are pretty good at optimizing your code. They will probably be able to determine most instances in your code where a virtual function call is not really necessary and can be treated as a normal function call.

However, there is another source of performance loss that should be of much greater concern to game programmers who need the most performance from their application. Virtual functions increase the probability of data cache misses especially on platforms with only a small amount of cache memory. These misses can be a significant source of performance loss on platforms such as consoles or mobile devices which have slow main memory speed. Accessing main memory can be several orders of magnitude greater than accessing the data cache.

In order to understand how virtual functions affect data cache performance, we must look at how a typical data cache system works. Data caches are divided into cache lines that are all of the same size, say 128 bytes. When a data item is loaded into the cache from main memory, the cache system will load enough bytes to fill one cache line. So if the data item is 32 bytes, the cache system will load those 32 bytes along with the next 96 bytes in order to fill one 128 byte cache line. If these extra bytes are not needed, then cache space is wasted and data access misses become more likely. Also, it chooses which cache line to load the data into by using a subset of bits from the data item’s memory address, so if the address is not aligned on the proper boundary, the data item may be placed far from other data already in the cache. We also need to understand that when cache systems need to replace the data in the cache, they often employ a Least-Recently-Used replacement policy in which the data items that are used the least are replaced first.

So, to make our code more cache-friendly, it is necessary for it to exhibit the principles of spatial locality of reference and temporality of reference . In other words, the data items referenced by the code should be adjacent to each other in contiguous memory aligned on the proper address boundary and they should be the most frequently used. It is harder to achieve these principles when using virtual functions because every virtual function call requires a v-table pointer as well the v-table itself to be loaded into the data cache. The v-table pointer is usually the first 4 bytes in an object, so after it is loaded into the cache, the next adjacent bytes in the object are loaded. If we want cache usage to be optimal, we should ensure that the most used member variables in the object’s class are the first ones declared so their memory will be adjacent to the v-table pointer memory. But in practice this is difficult to achieve for every member function and so quite often cache usage is less than optimal. Also, since we have no control over where the v-tables are located in memory, quite often they will not have spatial locality. If we need to load v-tables from many different objects in a random order, we also no longer have temporality of reference and data cache misses will be more likely. On top of all this, the data cache misses will be hard to detect in a profiler.

So why even use polymorphism in our code? Why not just avoid it entirely so virtual functions are not necessary? Well, that decision is entirely up to the programmer, but it should not be made lightly since the benefits provided by polymorphism are significant and cannot be ignored.

Let’s look at an example of how runtime polymorphism is often used in games to demonstrate the pros and cons of polymorphism.

Updating objects polymorphically

When developing a game engine, it is often necessary to iterate through a large number of related objects and call one of their common methods. When this is done during the engine’s main loop, it must be as efficient as possible. If every object is the same type we could simply have a loop which calls some method on every object that exists in the system at that time. But when the objects are of many different types, we would need to have multiple loops. For example, if we have a system which contains three different object subtypes, such as Fighter, Bomber, and Recon, then we would have three loops, one for each subtype.

Fig. 1

while( fighter )  
{
fighter->updateAI( dt );
fighter = fighter->next;
}

while( bomber )
{
bomber->updateAI( dt );
bomber = bomber->next;
}

etc.
The one major drawback to this approach however, is that it is not extensible at all. Extensibility is the ability to update the engine easily and reliably, such as by adding more game object subtypes to the list of already existing ones. So for instance, if we wanted to add a Seaplane subtype, we should be able to do this with no edits to the existing code base and with only minimal additions to the code, such as adding a new inherited class. The system in fig. 1 does not have the property of extensibility since adding more subtypes would require adding more loops to the code. If these loops are duplicated all throughout the code, then adding new ones is tedious and error prone. This property is known in object-oriented programming parlance as rigidity. Not a good thing to have in a game engine, or any software for that matter. Even if we use a single loop and check the object type using an if-else block or a switch statement, the code would still be riddled with large chunks of duplication and would therefore still be very rigid. If you don’t plan on updating your engine, rigidity might not be an issue however.

Of course this is a classic example of when run-time polymorphism can be very beneficial. This is usually done using the commonly known template pattern (not to be confused with C++ templates). The template pattern is often implemented by having an abstract base interface class that all the subtype classes inherit from. So in our example, we would have:

Fig. 2

class GameObject
{
public:
virtual void updateAI( float dt ) = 0;
};

class Fighter : public GameObject
{
public:
void updateAI( float dt ) {…}
};

class Bomber : public GameObject
{
public:
void updateAI( float dt ){…}
} ;

Etc…

while( obj )
{
obj->updateAI( dt );
obj = obj->next;
}
So now all we have to do to update all objects in the system is to have a single pointer which is of type GameObject and point it to each instance of a GameObject whether it is a Fighter, Bomber, or whatever, and call its updateAI method. At run-time, the type of the object will be used to determine which function is called, not the pointer type. We now only need one loop that will never change even if we add more subtypes to the system.

Making Polymorphism faster

Ok, so now that we have seen the benefits of polymorphism, let’s assume that we have decided it is indispensable in the design of our game’s code but the performance is still a concern. One way to improve the performance and still retain the benefits is to add a higher level of abstraction to the system. Let us consider the situation where we have one thousand objects that need to be updated, but only two of them are of subtype Bomber, and the other nine hundred ninety-eight are Fighters? Then the virtual table lookup that is being done on every object in the loop becomes redundant since nine hundred ninety-eight times the object is the same type. This redundancy is obviously not very desirable and begs the question, is there a better way to implement the loop that is not so redundant?

First let us consider whether compile-time polymorphism is the answer. This technique was explained in great depth in the GameDev.net article, Improving Performance in C++ with Compile Time Polymorphism by Ben Sunshine-Hill. With this technique we use C++ templates to move the class type of the objects in the loop out of the loop’s definition thereby making it more generic.

Fig. 3

template<class T>

void updateAll( float dt )
{
while( t )
{
t->updateAI( dt )
t = t->next;
}
}
So now the type of the object will be determined at compile-time instead of at run-time. But when we try to put this code into context within the engine we see we have not really achieved anything since we will have to call the updateAll function for every subtype in the system:

Fig. 4

updateAll<Fighter>( dt );

updateAll<Bomber>( dt );

updateAll<Recon>( dt );

This is akin to the situation we had in Fig. 1 at the start of the article because even though we don’t have to duplicate the loop code we do have to duplicate the calling of the loop code, and any kind of duplication is not conducive to clean, extensible, maintainable code.

Compile-time polymorphism is a good solution to the problem of updating the same object many times, but not the problem of updating many different objects which have the same type.

So is there any way to increase performance in our system? Fortunately the answer is yes, there is a better way. As stated before, we can add more abstraction to our system in order to move the polymorphism to a higher layer and thus reduce the frequency of virtual function calls.

Let’s take another look at Fig.1. We will notice that we have an update loop for each object subtype. What if we contain such a loop within a virtual method of a new class? Could we not then use run-time polymorphism on this class instead of the object class? Yes, if we define a base class which acts as a container or a bucket for an object subtype, then we could have this class contain a list of objects of that subtype and only update these objects. Then we could inherit from this class for each of the various subtypes. Then at run-time we could use a base bucket class pointer to point to each of the various derived bucket class instances and call their updateAll functions. These bucket classes would not have to use virtual functions to update each of the objects in their list since the objects will always be of the same type. So now we have:

Fig. 5

class Bucket
{
public:
virtual void updateAll( float dt ) = 0;
};

class FighterBucket : public Bucket
{
public:
virtual void updateAll()
{
while ( fighter )
{
fighter->updateAI( dt ); // call to non-virtual function
fighter = fighter->next;
}
}
};

class BomberBucket : public Bucket
{
virtual void updateAll()
{
while ( bomber)
{
bomber->updateAI( dt ); // call to non-virtual function
bomber = bomber->next;
}
}
};

etc.


So now we have the same template pattern as in Fig. 2 but it is applied to the buckets instead of to the objects. This means that when we do our update of one thousand objects where most of the objects are the same type, we will only have to do a maximum of three virtual method calls, instead of one thousand. And we still have all the extensibility benefits of run-time polymorphism. Splendid! We have achieved the best of both worlds, a system that is both efficient and extensible. This kind of batch processing also makes it easier to optimize the memory usage of each object type by having each container keep its list of objects in contiguous memory which leads to better cache coherency through spatial locality. Essentially what we have done is to add another level of abstraction, the buckets, in order to improve efficiency. In the source code for this article, the performances of the various methods outlined above are compared by recording their execution times.

When the program is profiled in the AMD CodeAnalyst tool, we can see the the runSimple function takes 2.73 timer samples, the runPoly function takes 6.36, and runBucket takes 0. Total samples to update each system are:

Simple: 27.57

Bucket: 28.18

Polymorphism: 40.9

As you can see, polymorphism is quite fast, but ordering our objects by their type increases the speed by a noticeable amount. Batch processing is also beneficial in that it can help to resolve the conflicts caused by inter-object dependencies in complex game object systems.

Choosing the correct level of abstraction in object-oriented designs is always important for performance reasons. In another example, say we have a class that uses many small virtual functions that are only one or two lines and are usually called from within another function. In this case, it is a good idea to move the point of abstraction to a higher level in the code by combining all of the virtual calls into one large one.

Fig. 6

class SpaceShip 
{
public:
virtual void MissileHit() {
m_iEnergy -= 10;
}
virtual void OnMsg( Msg *msg )
{
switch( msg->type_ )
{
case HIT_BY_MISSILE:
MissileHit();
SendMsg( MSG_TARGET_DESTROYED, msg->parent );
break;
}
}
};

class EnemySpaceShip : public SpaceShip
{
public:
virtual void MissileHit() {
// do some class specific stuff
}

virtual void OnMsg( Msg *msg )
{
switch( msg->type_ )
{
case HIT_BY_CANNON:
//case-specific code
break;
default:
SpaceShip::OnMsg( msg );
break;
}
}
};

Here the programmer has chosen to make both the MissileHit function and the OnMsg function virtual so they can be overridden in derived classes and customized. However, MissileHit is not a good candidate for a virtual function since it is only one or two lines and is a better candidate for being inlined. Also, since it is called from within OnMsg, we now have nested virtual functions, which constitute unnecessary overhead and probably poor design. Let’s see if we can reduce the number of virtual function calls:

Fig. 7

class SpaceShip 
{
public:
inline void MissileHit()
{
m_iEnergy -= 10;
}
virtual void OnMsg( Msg *msg )
{
//same as before
}
};

class EnemySpaceShip : public SpaceShip
{
public:
inline void MissileHit()
{
// do some class specific stuff
}
virtual void OnMsg( Msg *msg )
{
switch( msg->type_ )
{
case HIT_BY_CANNON:
//case-specific code
break;
case HIT_BY_MISSILE:
EnemySpaceShip::MissileHit();
SendMsg( MSG_TARGET_DESTROYED, msg->parent );
break;
}
}
};
We have now made MissileHit an inline non-virtual function and we have added a new case to the derived version of OnMsg to call the derived version of MissileHit so it does not need to be virtual. This does make the code a little more redundant, but the performance benefit is very desirable and the redundancy can be removed by continuing to refactor the code into a better design such as using function pointers.

In many high-level languages, such as Java , overridden methods in derived classes are always virtual, but in C++, we have the choice of specifying which methods should be virtual. This can be used to our advantage when trying to achieve better performance.

Conclusion

Hopefully, this article will be beneficial to C++ programmers hoping to be able to get more performance from their code. Unlike many other high-level languages, C++ requires more programmer effort and has a much steeper learning curve, but it is often easier to wring more performance out of the language, which is what it’s designers intended. This is probably why it is still the most widely used language for game development and will continue to be so for quite some time.

The reader is encouraged to check out the bibliography for this article which lists many sources that contain many more useful techniques for dealing with virtual functions.

Happy Coding ,
Gabriel T. Delarosa
May 17, 2010

Bibliography:

Llopis, Noel, “C++ for Game Programmers “(Charles River Media, 2003, ISBN: 1-58450-227-4 )

Hyde, Randall, “Write Great Code Understanding the Machine” (No Starch Press, 2004, ISBN: 1-59327-003-8)

http://aigamedev.com/open/highlights/optimize-virtual-functions/

你可能感兴趣的:(Polymorphism)