Chapter 4.13. Creating a Multi-Threaded Actor-Based Architecture Using Intel® Threading Building Blocks

Robert Jay Gould, Square-Enix

With the next generation of consoles and personal computers having dozens of smaller processing cores, developers will have to redesign the architecture of their game engines to take advantage of this processing power. In some areas, such as graphics and physics, various methods to achieve higher performance already exist. In general, there has been less success in adapting higher-level AI- and gameplay-related systems to massively multi-cored environments.

This gem presents one architecture and implementation that can be used for AI and gameplay systems capable of scaling through high concurrency. The proposed architecture is a multi-threaded message passing actor architecture, which uses engineering tradeoffs similar to those of the Erlang language in order to achieve its concurrency and scalability. The implementation is in C++, using Intel’s Threading Building Blocks (TBB) for high-performance concurrency and Lua to create a friendly programming interface that hides the typical complexity associated with message-passing systems so that designers and gameplay programmers can get their work done productively.

Introduction

When striving for performance in multi-cored environments, the only solution is to increase the concurrency (amount of work that can be done in parallel) to leverage the system’s processing power. Thus, finding concurrency patterns in gameplay systems should be the first goal when designing scalable architectures.

Finding Concurrency in Games

Games are composed of several systems, and although each system might have its own peculiarities, they can be broadly grouped together by considering what kind of concurrency, or parallel, programming patterns [Mattson04] best suit them. For example, systems like graphics and physics that are almost entirely data-driven can be easily parallelized by applying data decomposition patterns and using implicit parallelization and referential transparency techniques. On the other hand, gameplay systems are primarily event-driven. If one looks at event handling as creating tasks to process events, task decomposition patterns can be applied to gameplay systems to increase their concurrency.

Task Decomposition

In simple terms, task decomposition is about breaking down the useful work of a system into atomic tasks that can be processed concurrently, without compromising the state of the system. In games, we usually have no problem in dividing work into tasks, as this is something that is already part of most game architectures today. Instead, fulfilling the “without compromising the state” part is the tricky part in game development. So finding ways in which state can be effectively divided and can be protected is the key to achieving task decomposition in the event-driven parts of our game engines.

State Protection Strategies

Perhaps the most familiar way to protect the state of a system, for game programmers, is through lock-based synchronization. As most programmers also know, lock-based systems can grow in complexity quite easily, and avoiding dead locks, live locks, data races, and convoying can consume a large chunk of precious development time. Besides the complexity and instruction overhead, lock-based synchronization also forces a non-negligible degree of serialization on processing, meaning that as the number of processing cores increases, scalability levels out, and it provides decreasing returns. Of course, the ideal alternative to lock-based synchronization is to use lock-free algorithms, but besides implementations being devilishly difficult even for algorithm experts, in many cases there are simply no known lock-free algorithms for some problems.

Other common techniques involve attacking the problem by using memory policies as opposed to processing policies, as the previous two alternatives do. One of these techniques, generally applicable to any sort of system, is to use immutable data like pure functional languages have decided to do. The benefit of immutable data is that state cannot be compromised; instead, state changes are implemented through data copying and garbage collection. Unfortunately, in systems that require lots of data and have rather tight memory constraints, such as games, all this redundancy can become an issue in itself. One solution to reduce this redundancy is called transactional memory. It consists of maintaining metadata of the protected state, so it is possible to determine when conflicts arise and roll back and retry conflicting operations.

The strength of transactional memory is that unlike locks, it provides increasing returns as concurrency increases, so one day transactional memory may be the easiest solution for high throughput systems, such as games [Sweeney08]. However, we are limited to software transactional memory (STM), as there is no hardware support for these features. STM’s current downsides are mostly due to its experimental nature, which requires experimental compiler extensions and proprietary run-time libraries. It also has a large memory and instruction overhead due to the current bookkeeping algorithms, meaning that its performance penalties typically outweigh its benefits when shared state contention levels are low [Cascaval08], as they typically are in games.

Finally, one more solution—and the one used by the architecture presented in this gem—is the shared nothing [Stonebraker86] approach, where state cannot be compromised because it is not shared. This is typically achieved by having each concurrent process be completely responsible for modifying its own data through strong encapsulation. For this approach to scale on multi-cored environments, implementing lightweight processing units that can work concurrently is the way to go, and these techniques can be found in actor-based programming.

Actor-Based Programming

Actor-based programming is a programming paradigm in which the basic programming and processing unit is the actor. Actors are similar to objects in that they encapsulate state, but unlike typical objects they are also capable of processing information on their own, sort of like a cross between an object and a thread. Another difference between C++ style objects and actors is that instead of relying on method invocation for interaction, actors use message passing for communication.

Unfortunately, many game programmers relate message passing to the typical centralized message-pump design commonly used in games (as if the entire game were a single actor). Message passing is also associated with lots of the boilerplate code necessary to define and handle all these messages, but these are just the artifacts of a poor interface design and C++’s syntax. In fact, message-passing interfaces can be quite elegant without any tedious boilerplate code, with Smalltalk and Objective-C being good examples of elegant message-passing interfaces.

The actor-based architecture in this gem is in great part inspired by the Erlang language, which is the most successful actor-oriented language in use and also one of the most scalable languages in general. Erlang is a mature language that was designed by Joe Armstrong while working at Ericsson in 1986 to develop robust and scalable applications to handle telecommunications systems [Armstrong07]. Nowadays, Erlang is used by many telecoms around the world to power their telephony systems, and recently Erlang has been gaining ground in scalable commercial web applications, such as Amazon’s EC2 and SimpleDB platforms and Facebook’s chat system. Erlang’s younger sibling and close competitor, Scala, which runs on the Java Virtual Machine, is used to provide the scalability behind services such as Twitter, demonstrating the power of actor-based designs.

One critical diverging engineering point between Erlang and the architecture in this gem is that Erlang is optimized for scaling in highly distributed environments, meaning that it has efficient IO handling as one of its central pillars. The actor-based architecture in this gem is optimized for running in highly multi-cored environments, focusing on efficient use of a single machine’s multiple cores, foregoing IO-based considerations almost entirely.

Implementing an Actor Engine

The first concern of implementing an actor engine is to fulfill the need to support the ability of each actor to process its own work individually and concurrently. The simplest, yet least scalable solution to this is to provide an individual physical thread (or even OS-level process) for each actor. Unfortunately, that means that as the number of actors increases, the memory and processing overhead due to context switches increases, and scalability will level off and even decrease. The better solution is to use something lighter than threads, as Erlang does, so we can support thousands of concurrent actors with near linear scalability. In Erlang, this primitive processing unit is simply called a process, which is a lightweight context similar in nature to a green thread or a coroutine in implementation. The actor engine herein utilizes Intel’s TBB library to provide the parallel processing needs of its actors, based on task-processing algorithms.

Intel’s TBB is an open-source threading library that provides a platform- and compiler-independent production-ready foundation for building multi-threaded applications that only require standard C++ support from the compiler. It includes several high-level tools such as concurrent algorithms like tbb::parallel_for and several thread-safe containers like tbb::concurrent_hash and tbb::concurrent_queue. There are also lower-level tools, such as scalable memory allocators, and a scalable task scheduler, tbb::task_scheduler, that is used to power all of TBB’s higher-level concurrent algorithms. It is these lower-level task scheduler components that are used to construct the processing workhorse for the actor engine of this gem.

The Task Scheduler

The task scheduler’s main job is to effectively map its task-based workload onto the resources available to it in a way that avoids unnecessary context switching, thus providing a scalable solution for task-based concurrency. Yet its performance benefits don’t stop there. Unlike most simple schedulers, the tbb::task_scheduler it isn’t a fair scheduler that uses a round-robin-like algorithm. Instead, it is implemented as a finely tuned work-stealing scheduler [Intel09] that schedules tasks in a fashion that decreases cache misses and memory thrashing. It does this by first working on tasks that are hot in the cache and then working its way onto colder tasks as necessary. This cache-friendly scheduling also means that the task scheduler actually resembles a LIFO-like processor more than a typical FIFO-like processor.

What this means to our architecture, besides high performance processing, is that it doesn’t impose any ordering restrictions on message processing. Although this might sound strange at first, allowing messages to be handled out of order is just what Erlang’s message passing algorithm does, because this is an important area that is best left open for scalability optimizations.

The Message Class

With the processing implementation decided, we’ll quickly look at the messages that will be used to allow actors to interact. This, however, is actually just a simple implementation detail, and the precise type of a message is probably best left for each system to define based on the system’s other requirements. For this reason, the message type that actors handle is simply defined as a template argument in our implementation.

The Actor Class

In implementing the actor class, we can divide its functionality into four fundamental parts—its internal state, a message queue, a message processor, and the message handling. The message queue and message processing are the critical core features that are in the domain of the system programmer to implement efficiently to provide good scalability with little overhead, and as such, they are implemented in the base actor class. The internal state and the message handling are the extensible features of the actor, so they are derived by subclasses of the base actor class.

The Message Queue

As one of the core features of the actor class, the message queue is important because it is the only contention point in the actor’s architecture where several threads may interact on the same data. This implies that its performance directly affects the overall performance of the actor and the scalability of system at large. The obvious candidate as a container for our message queue is tbb::concurrent_queue. It provides a thread-safe queue that allows several actors to add messages to it concurrently, with a relatively low overhead when compared to a std::queue with a big lock around it.

On this subject, an optimization not present in the sample code but likely of interest to system programmers is that the tbb::concurrent_queue is what is called a thread-safe multi-producer multi-consumer (MPMC) queue, which is safe when several producers and consumers push and pop on the queue concurrently. However, our actors only need a multi-producer single-consumer (MPSC) queue, because the only consumer of the queue is the actor itself. So we are paying for more than we need by using tbb::concurrent_queue. While doing tests and benchmarks using an excellent lock-free MPSC queue algorithm [V’jukov08], the system’s performance increased as much as 10 percent under a heavy load with high contention rates, showing just how important the performance of the message queue is to the actor engine.

Actor Hierarchies

In this architecture, any number of root actors can be spawned, and all actors are allowed to spawn children, which become their responsibility. This means there can typically be many trees of actors (a forest) running concurrently. This is useful because it allows any particular actor hierarchies to be spawned or shut down independently, allowing particular subsystems to be restarted without affecting others. As long as two actors can understand each other’s messages, they can communicate, even if they belong to different hierarchies; however, actors within a single hierarchy will likely be processed in closer temporal proximity.

Example 4.13.1. Actor construction

Actor::Actor( Actor* parent) :pParent(parent)
{
    if (pParent)
    {
        //Get root processing unit/task
        pRootTask = pParent->pRootTask;
    }
    else
    {
        //Become a new root   processing   unit/task
        pRootTask = new   (
            tbb::task::allocate_root())tbb::empty_task();
        pRootTask->set_ref_count(++childrenCounter);
    }
}

As seen in Listing 4.13.1, root actors keep circular references to their own TBB processing root task. This prevents the scheduler from cleaning up the task associated with each processing hierarchy when no work is left, allowing us to reuse the same root task for each actor tree indefinitely. Also of note is TBB’s use of in-place object allocation using new(tbb::task::allocate_root()). This idiom has the purpose of avoiding the overhead of creating tasks by recycling tasks from object pools behind the scene. Both features serve to avoid memory management bottlenecks due to the large number of tasks that will be spawned during the engine’s lifetime.

Message Processing

The message processing cycle is the heart of the actor engine, and most of the important implementation details are located in this section. Things will get a bit grittier from here on.

As in Erlang, message passing between actors in our design is totally asynchronous. The interactions among actors are entirely one-way affairs, not requiring any sort of handshake between actors. This allows the sender to continue on with its own processing without having to wait on the recipient to handle the message sent. If we forced interactions to be synchronous by requiring handshakes, message passing would not be scalable as Erlang’s message passing is, and instead it would be similar to the message passing found in Objective-C or Smalltalk.

By decoupling actor interactions, the message handling can be decomposed into a great number of very finely grained discrete tasks. Theoretically, this allows the system to scale linearly as long as actors outnumber the number of processing cores. However, in reality, the creation and processing of tasks by the task scheduler has an overhead in itself. Because of this, TBB’s documentation recommends that the grain size of a task be between 100 and 100,000 instructions for the best performance [Intel09]. So unless most message handling is quite heavy, assigning one task per message can become quite wasteful. This issue requires optimizations to increase the total throughput of the system.

When an actor places a message into another actor’s inbox, the processing thread kick-starts the recipient actor, as shown in Listing 4.13.2. In situations where the sender will be waiting for a reply from the recipient, this optimization allows for reduced latency as the recipient’s task will be hot in the cache and favored by the task scheduler for being processed next. More importantly, this design also removes the need for having actors wastefully poll for work, making processing entirely event-driven.

Example 4.13.2. Message passing

void Actor::inbox(Actor*   sender,   const   MESSAGE_TYPE&  msg)
{
     this->messageQueue..push(msg);
     this->tryProcessing(sender->GetProcessingTask());
}
void Actor::tryProcessing(tbb::task* processing_unit)
{
     if( !messageQueue.empty() &&
         isProcessingNow.compare_and_swap(true,false))
     {
         //use a continuation task
         tbb::empty_task* continuation = new(
             root->allocate_continuation())tbb::empty task;
         this->pMessageProcessorTask = new(
             continuation.allocate_child())MsgProcessor(this);
         continuation.set_ref_count(1);
         this->root->spawn( this->pMessageProcessorTask);
     }
}

Also in Listing 4.13.2, you can see that TBB continuations are utilized. This allows more freedom to the scheduler for optimizing its own workflow by decoupling execution order. A detailed explanation of continuation-style programming can be found in [Dybvig03].

As mentioned earlier, to increase the workload of a single task over 100 instructions, an actor will consume the entirety of its message queue, as well as any messages that arrive while it is processing within the execution of a single task, as seen in Listing 4.13.3. This design dynamically eases the task-processing overhead as work increases.

Example 4.13.3. Message consumption loop

void Actor::processAllMessages()
{
     isProcessingNow = true;
     Msg_t msg;
     while(messageQueue.trypop(msg))
     {
         try
         {
              if(!this->receive(msg))
              {
                  throw(messageHandlingError);
               }
         }
         catch(tbb::exception e)
         {
             sendException(pParent, e, msg);
         }

     }

     isProcessingNow = false;
}

Also of note is the error handling logic employed for message processing. In an actor-based architecture, when an actor encounters an exception or error it cannot handle, it doesn’t make sense for the error to bubble up through the call stack. Instead, it needs to bubble up through its actor hierarchy, forwarding the exception to its parent. Generally, when a parent receives an error from a child, it has few options because it can’t query or fix the child’s state directly. Typical error handling includes ignoring the message altogether, reporting the error to another actor, or as is common in Erlang, creating a fresh new child actor to replace the actor having problems. The functionality of the failing actor is thereby automatically reset to a clean state.

Message Handling

Like actors in Erlang, our actors have a receive method, which can be seen in Listing 4.13.3. The receive method is called whenever a message needs handling; the actual implementation of this method is the responsibility of the derived actor classes. Typical implementations of receive consist of matching message signatures to specific message handling routines. This can be done in many ways, from something like a switch/case construct or a series of if/else statements to something more complex, such as recognizing message signatures, regular expression matching, hierarchical state-machines, or even neural networks. A couple of implementation prototypes that might serve as inspiration can be found on the CD, but the one design that will be the focus for the rest of this gem is that of a scripted actor, which matches message signatures directly to Lua functions registered to the actor. This straightforward design provides a simple and generic, boilerplate-free solution that can be easily extended by gameplay programmers and designers using Lua alone to build whatever class hierarchies or aggregations they need to get their work done, without having to touch any of the library code directly.

As promised earlier in this gem, message-passing APIs can be friendly and don’t have to involve lots of boilerplate, as the Lua code in Listing 4.13.4 demonstrates. In fact, it looks pretty much like any ordinary Lua API.

Example 4.13.4. Lua-based actor API

class “Player”:inherit “Character”
{
    attack=function(target)
        target:mod_hp(math.random(5,10))
    end,
}
–Example use case
local player = getActor(“player1”,”Player”)
local enemy = getActor(“goblin01”)
player:attack(enemy)

To provide a clean, scriptable API, getActor takes an actor name and returns an actor proxy, not an actual actor reference. Using the proxy’s metatable’s index logic, it is capable of responding to any sort of function call. The proxy then converts these object-oriented function calls and its arguments into a message that gets sent to the C++ actor. Of course, simply allowing game code to call any method on the proxy with no error checking could make debugging quite complicated, but typically an unhandled message log combined with some class-based checking makes most debugging reasonable. To add the class-based message checking feature to a proxy, an optional second argument is passed in—the class name of the class that should be used for verifying message calls, seen in Listing 4.13.5.

Example 4.13.5. Lua actor proxies

function getActor(address,classname)
    local class = findClass(classname)
    return    setmetatable({__address=address,__class=class},mt_actor)
end

mt_actor.__index = function(self,message,...)
    if rawget(self,“class”) and not self.class[message] then
        error((“class [%s] can’t handle message [%s]”):format(
               self._class.name,message))
    else
        self._message = message
        return send(self) → a C-function connected to the actor engine
    end
end

Message-Passing Patterns

When working with asynchronous actors and message passing, some issues may appear. As with other styles of programming, there are already several useful and tested programming patterns that can be applied to resolve most of these problems.

Querying for State and Promises

At times an actor may need to query the state of another actor, not just send it a message. Because actor message passing is asynchronous, this can be an issue. One solution is to add a way to force a synchronous message passing when required, but that adds coupling to the system that could compromise its concurrency. Another solution, and the one commonly used by actor-based architectures, is to handle this scenario using a future or promise. Promises are implemented by sending along with the message the address of the calling actor and a continuation context, to which the recipient will respond back to, resuming the context of the sender, or in this implementation by use of TBB continuations [Werth09]. In the proposed API, this is handled as seen in Listing 4.13.6, by calling the promise to block until the value is returned.

Example 4.13.6. Using promises to query an actor for state

promiseHp = enemy:get_hp()
if promiseHp() > 50 then –-invoking the promise returns its value
    player:run()
end

Another type of promise is to obtain an actor reference from another actor, as in Listing 4.13.7. This style of promise is accomplished by chaining proxy promises.

Example 4.13.7. Using promises to obtain actors in second degree

player.healClosest = function(self)
    local map = getActor(“map”)
    local closestActor =
        map:findClosestActor(self:get_position())
    closestActor:heal(100)
end

Sequential Message Processing

Another special case that requires consideration because the actor-based system is asynchronous is how to handle messages when they only make sense in a certain order. For example, when opening a door, it is required to first insert a key and then push it open. However, there is no assurance that the “insert key” message will actually arrive before the “push door” message, even if they were sent in the correct order. The typical actor-based solution is to use a sequencer actor that works in conjunction with the door. The job of the sequencer is to queue and sort messages according to some internal logic and then forward them in an appropriate order as they become available. In our example, the sequencer would not send the “push door” message to the door before it has received the “insert key” message. Although sequencers tend to be more effective than direct coupling, they do introduce a degree of serialization to the code, so they should be used only where truly necessary.

Message Epochs

One more common scenario is that it is possible for an actor to receive a message that, say, lowers its HP below 0, although a heal message had actually been sent before the damage message was issued. In most cases within a single atomic gameplay frame, the actual production order of messages should not be important, making this sort of issue actually more like a quantum physics conundrum than a real gameplay issue. This means that generally, this sort of event can be ignored with no detriment to the game. Nevertheless, when disambiguation is required, an easy solution is to use epochs [Solworth92] or timestamps to determine which message was sent first.

Conclusion

This gem reviewed the requirements and alternatives to build a highly scalable architecture and went into the details of implementing one viable alternative, based on the actor model and a shared nothing policy using Intel’s Threading Building Blocks. On the CD there is a reference implementation in the form of an actor-based Lua console, along with sample scripts for experimentation with the concepts presented in this gem.

References

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset