MaxCore™:  All your cores belong to us!

February 11, 2016 Matt Shaw

Virtual Reality is ushering in an era of amazing new experiences, but it also introduces a variety of new challenges that require innovative solutions. One such challenge is the “power hungry” nature of Virtual Reality applications. VR pushes the envelope in graphics and simulation, so much so that the hardware requirements for good VR experiences have recently become a hot topic. These steep requirements put a ton of pressure on the software behind the VR experiences to fully utilize the available hardware, including making better use of multi-core CPUs.

MaxCore™ is the term MaxPlay uses to denote that our engine can fully leverage all of the CPU cores on your device whether it be a phone, PC or game console. You might be surprised to learn that many of the most popular game engines cannot do this and therefore it’s possible that much of your CPU’s potential is going to waste. That’s a shame given that maximum performance is needed for the best, richest experience, especially for VR where every FPS counts.

The Rise of Multi-core!

Before we get into how our engine works, let’s talk about why chip manufacturers are increasing number of cores in modern processors.

In the ongoing race to make computers faster, transistor counts multiplied and CPU clock speeds have steadily increased.  Unfortunately, as more transistors are added and clock speeds increase, the processors need more and more energy to operate, therefore generating more heat. Excess heat damages the CPU components and you may have noticed that when you are playing some games on your phone, it can become uncomfortably warm. As processors speed up, thermal management becomes a huge issue. CPU manufacturers must balance clock speed and heat in order to stay within a safe range of heat for the device, which is also known as the thermal envelope.

The most significant trend to combat this problem is to put multiple CPU execution cores on one physical chip running at a lower clock speed. A CPU with two cores can execute tasks basically as fast as one CPU running twice the speed, however the dual core CPU consumes less energy and therefore stays cooler.

Today, it’s almost impossible to buy a CPU with only one core. The iPhone 6 has a processor with 2 physical cores. High-end Android phones come with as many as 8 cores. Major PC manufacturers are now producing chips with 8 cores for the consumer market. Even the PlayStation 4 has an 8 core processors, although until recently developers only had access to 6 of the 8, now they can access 7.

However, to take full advantage of multiple cores, an effective multithreading software architecture is a requirement. Unfortunately, most game engines aren’t designed to handle the rapid proliferation of additional CPU cores.

 

Programming for multi-core:

As the saying goes, “there is no such thing as a free lunch” and multithreading a program or engine to utilize multiple CPUs requires careful thought. Public enemy number one is data contention. Multiple CPU execution paths need access to the same data.

Here’s a relatively simple example. Let’s say that one CPU writes a value into memory, literally at the same time another CPU is trying to read that exact same value. The second CPU may read partially written data, which is therefore useless data. This is called a race condition which can crash a program or lead to unexpected results.

To protect data from simultaneous CPU access, programmers employ synchronization objects. While this synchronization process does work, it can result in CPUs waiting for synchronization, which leads to uneven performance and wasted time. Architectural designs that minimize the number of synchronizations required are definitely the best. This is what makes fully utilizing a multi-core processor’s full potential difficult. If programmers don't design and implement applications correctly to take advantage of multiple cores they may in fact execute slower.

Performance in a game engine is all about gathering input, simulating the world, and rendering.  The faster and more evenly you can do that, the better the frame rate and user experience will be. To achieve 90 frames per second requires that you gather input, simulate the world, and render it in 11 thousandths of a second.  For VR, rendering has to be done twice, once for each eye.  Eleven thousandths of a second is a really short amount of time, so game developers struggle to keep that 90 FPS frame rate, usually by reducing quality of content until it's achieved. Reduced quality of content equates to a compromised experience.

Many engines have been retrofitted to support multiple cores through a few design modifications. This will improve performance for a few cores, but data contention minimizes potential gains and furthermore doesn’t scale well beyond just a few CPU cores.

 

Retrofitting for Multi-core:

Most older game engine architectures have been retrofitted with various forms of functional multithreading.

The first thing that is usually added is separating the renderer into a separate thread. Next, areas are identified that more easily lend themselves to parallelization such as physics and AI.

Instead of looping over all of the physics and AI objects in a linear fashion, the work is distributed across multiple threads which run on multiple CPU cores. When that work is done, the program returns to linear execution. More advanced systems use a job model that allows the threads to pick up work if they are available. In practice this type of multithreading can realize some reasonable gains, but there will be large amounts of CPU processing time in the frame that is wasted waiting for synchronization or with no work for one or more of the CPUs. Hand tuning may work for a particular product, but it doesn’t load balance itself well automatically. The result is uneven CPU usage as demonstrated below. Even the game community at large has recognized that most games built with older game engines won’t show much improvement going beyond four cores.

 

Uneven CPU usage in today’s popular game engines:

erformance.jpg

Applications–or engines in our case–need to be designed to minimize data contention. It's often impractical to retrofit an existing architecture to fully take advantage of multi-core. The best way to fully utilize multiple CPU cores is to design for it from the ground up, but that is expensive and time consuming.

 

Designing for Multi-Core, MaxPlay MaxCore:

At MaxPlay we have designed our engine from the ground up to scale for multi-core architectures. Looking at current CPU architecture and future roadmaps it was clear we needed a system that would address several areas of concern. The system must be scalable, have predictable load balancing, be extensible and maintain flow control over the simulation across “n” number CPU cores. We call this Multi-stage Data Parallel Simulation with an asynchronous renderer.

The simulation is modified to scale across the CPU cores, one core is reserved for rendering.  In order to achieve this, the simulation is broken into tasks that can all be discretely processed in any order. This basically means that the tasks are split into multiple stages, a read stage, decision making, and write stage.

Additionally you may need a broadcast stage for communication among other objects.

We provide a solid distribution function, per stage of execution, that will result in automatic load balancing for a given stage over time. As cores finish they are given more work until the stage is finished. We understand that there will be teams that will want to extend or replace our distribution functions with ones that fit their data/task model more closely so we provide easy-to-use mechanisms to monitor and facilitate the distribution functions.

 

A Multi-stage Data Parallel Simulation (with an asynchronous renderer):

CoreDiagram.jpg

This design should allow a large scale simulation with an asynchronous renderer. To illustrate the potential  performance gains, we’ve provided some data points from tests using the MaxPlay MaxCore runtime engine.

More Efficient CPU utilization resulting from Multi-stage Data Parallel Simulation:

Performance2.jpg

Above you can see a six core processor (with 6 hyperthreads) running a computationally intensive flocking simulation.  See the flocking simulation performance video here. Test results on a variety of hardware platforms indicate significant performance opportunities when compared to traditional functional-multithreading models. These are thousands of full game objects doing a full flocking simulation.  

Graph.jpg

In many cases, we’ve found that employing this technique will yield massive gains in performance, allowing developers to scale more content into their experiences. Indications from hardware manufacturers are that the number of cores will continue to multiply, so it’s important to consider a solution that will scale to 12, 16 and even 24 cores!

For those who wonder about systems with DirectX12 and Vulkan graphic APIs, we are well positioned to send  display lists and use returned data to the renderer from any of our stages as needed.

I hope this article has been helpful in explaining MaxPlay’s MaxCore™ multi-core architecture. Our goal is to create a solution that is well suited for today’s needs in gaming and VR, by providing a high performance framework for superior products on devices with two or more CPUs.

MaxPlay’s MaxCore™ runtime is just one component of the MaxPlay Game Development Suite. We have lots more to talk about in future blogs such as our MaxCore™ Immediate Mode Insertion, and all of the optimizations we are doing for in our rendering for VR. Stay tuned.