Enhancing VR Immersion with the CPU in Star Trek™: Bridge Crew

August 7, 2017, 1:46 pm

Latest and popular articles on Intel Technologies

≪ Previous: Explore the GPIO Example Application

Introduction

Traditionally in games, the CPU is not usually considered a large contributor to immersive and visually striking scenes. In the past, some few games exposed settings to users, allowing them to adjust CPU usage, but many developers believe it’s more trouble than it’s worth to implement multiple tiers of CPU based systems for different hardware levels. With this article, and the awesome work showcased in Star Trek™: Bridge Crew* in a partnership with Ubisoft’s Red Storm Entertainment*, we’re aiming to fix that misconception. Virtual Reality (VR) is a segment where the combination of environmental interaction, enhanced physics simulation, and destruction can become the frosting on the cake that keeps a player in your game and wanting more. Given the low-end hardware specifications required for Oculus*, it becomes more important than ever to eliminate the idea of keeping CPU work tailored to the minimum specification globally. Leveraging available system resources to enhance dynamism and immersion will help you create your ideal game while allowing as many players access as possible, and we’ve made it easier than ever.

This article will take you through each of the CPU-intensive features implemented in Star Trek™: Bridge Crew* with instructions on utilizing the systems they’ve been built upon. This is followed by with a brief section dedicated to determining how much CPU work is too much for each performance tier. The final section shows how to easily set up CPU performance categories to auto-detect where your end user’s hardware level sits.

Star Trek™: Bridge Crew* was built using Unity*, which will be the focus of this article, but all the concepts apply to other engines as well.

Check out the following video to see side-by-side comparisons of the game running these effects.

CPU Intensive Features in Star Trek™: Bridge Crew*

Bridge Damage – Combination of Physics Particles, Rigidbody Physics, and Realtime Global Illumination (GI)

Overview

The bridge of the USS Aegis is one of the main focal points of the game. Virtually all gameplay requires the player to be on the bridge, thus making it obvious that the bulk of the CPU work should be applied within it to give the player the most bang for their buck. The most time was spent focused on improving the bridge’s various destruction sequences adding elements of intensity to the scene. For example, when the bridge is damaged, big set pieces fly off; sparks ricochet off walls and floors; and fires spring up and throw a glow on surrounding set pieces not in direct view from the light source.

What Makes it CPU Intensive?

Applying damage to the bridge makes use of Unity’s* realtime Rigidbody physics, Particle Systems with large numbers of small particles with collision support enabled, as well as realtime global illumination (GI) updates created by the various fire and spark effects, which all scale across available CPU cores. Various debris objects are spawned that use Rigidbody physics during damage events, and the particle counts were pushed significantly higher when the high-end effects are active. World collision support for the particles was added along with collision primitives with the detail level needed to get the bouncing and scattering behavior desired for the sparks. Some of Unity’s* other particle features that use the CPU were added to enhance the scene, such as sub-emitters to add trails to fireballs and some sparks. The bridge damage particles were kept small in screen-coverage size to keep the GPU impact as low as possible while still achieving the desired look. When damage events occur, some of the lights and emissive surfaces flicker to simulate power interruption. The GI is updated while the lights are flickering and when there are fires active on the bridge. Next, we’ll go into each system and show how they can be leveraged separately.

Sparks

Overview

Unity’s* built-in Particle System component allows for a lot of variation in both aesthetics and behavior. It also just so happens that the built-in Particle System scales across available CPU cores very well under the hood. With the click of a button you can have your particle system collide and react to your environment, or if you want a more customized behavior, you can script the movement of each particle (more on this later). When using the built-in collision behavior shown below, the underlying engine will split the work up amongst available cores, allowing the system to go as wide as possible. Because of this, you can scale your particle counts based on the number of cores available, while also considering processor frequency and cache size. To activate collisions on your particles, simply go to the Particle System component of interest, check the Collision checkbox, and then select the desired settings associated with it.

There are quite a few options involved in the Collision settings group. The main setting to consider should be between colliding against the World or if you’d like to define a set of planes that the particles will collide with. The former setting will produce the most realistic simulation as virtually every existing collider in your scene will be considered in each particle update calculation, but this of course comes with additional CPU cost. Usually, games will define a set of key planes that will act as an approximation of the surrounding geometry to keep compute as low as possible to make room for other CPU intensive effects. The setting you choose depends entirely on the layout of your game and what you’d like to achieve visually. For example, the following defines three planes as colliders: a floor and two walls.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Assertions;

public class ParticleSystemController : MonoBehaviour
{
    public static ParticleSystemController Singleton = null;

    ParticleSystem[] ParticleSystems;
    public Transform[] CollisionPlanes;

    void Awake()
    {
        if(!Singleton)
        {
            Singleton = this;
            Debug.Log("Creating ParticleSystemController");
        }
        else
        {
            Assert.IsNotNull(Singleton, "(Obj:" + gameObject.name + ") Only 1 instance of ParticleSystemController needed at once");
            DestroyImmediate(this);
        }
    }

    public void Init()
    {
        ParticleSystems = gameObject.GetComponentsInChildren();
        Debug.Log("Initializing ParticleSystemController");
    }

    void Start()
    {
        SetCPULevel(CPUCapabilityManager.Singleton.CPUCapabilityLevel);
    }

    public void SetCPULevel(CPUCapabilityManager.SYSTEM_LEVELS sysLevel)
    {
        if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.HIGH)
        {
            for (int i = 0; i < ParticleSystems.Length; i++)
            {
                var particleSysMain = ParticleSystems[i].main;
                var particleSysCollision = ParticleSystems[i].collision;
                var particleSysEmission = ParticleSystems[i].emission;
                particleSysEmission.rateOverTime = 400.0f;
                particleSysMain.maxParticles = 20000;
                particleSysCollision.enabled = true;
                particleSysCollision.type = ParticleSystemCollisionType.World;
            }
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.MEDIUM)
        {
            for (int i = 0; i < ParticleSystems.Length; i++)
            {
                var particleSysMain = ParticleSystems[i].main;
                var particleSysCollision = ParticleSystems[i].collision;
                var particleSysEmission = ParticleSystems[i].emission;
                particleSysEmission.rateOverTime = 300.0f;
                particleSysMain.maxParticles = 10000;
                particleSysCollision.enabled = true;
                particleSysCollision.type = ParticleSystemCollisionType.World;
            }
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.LOW)
        {
            for (int i = 0; i < ParticleSystems.Length; i++)
            {
                var particleSysMain = ParticleSystems[i].main;
                var particleSysCollision = ParticleSystems[i].collision;
                var particleSysEmission = ParticleSystems[i].emission;
                particleSysEmission.rateOverTime = 200.0f;
                particleSysMain.maxParticles = 5000;
                particleSysCollision.enabled = true;
                particleSysCollision.type = ParticleSystemCollisionType.Planes;
                for (int j = 0; j < CollisionPlanes.Length; j++)
                {
                    particleSysCollision.SetPlane(j, CollisionPlanes[j]);
                }
            }
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.OFF)
        {
            for (int i = 0; i < ParticleSystems.Length; i++)
            {
                var particleSysMain = ParticleSystems[i].main;
                var particleSysCollision = ParticleSystems[i].collision;
                var particleSysEmission = ParticleSystems[i].emission;
                particleSysEmission.rateOverTime = 100.0f;
                particleSysMain.maxParticles = 3000;
                particleSysCollision.enabled = false;
            }
        }
    }
}

See more optimized version in CPUCapabilityTester sample

Realtime Global Illumination (GI)

Overview

Realtime GI is the simulation of light rays bouncing within a scene and indirectly illuminating objects. This feature was something the team really wanted to leverage because the big window at the front of the Aegis would allow for astral bodies and damage effects to update the interior of the bridge. Moving the Aegis in front of a massive sun or nebula changes the appearance of the bridge to reflect the incoming light, increasing immersion by giving the scene a cohesive look and making the vistas feel much more real.

What Makes it CPU Intensive?

Unity’s* realtime GI is computed heavily on the CPU and leverages a percentage of the available cores depending on the fidelity desired.

Is it Built into Unity*?

Yes. When the realtime GI effects are enabled, the application uses the highest CPU usage setting Unity* allows with an immediate update rate to get the best results.

How it’s Done

To enable this effect, check the Realtime Lighting checkbox in the Lighting window (Window> Lighting). (Note: Editor performance settings for Realtime GI were hidden in recent versions of Unity* and handled under the hood. Scripted update settings are still available – see sample for details) On older versions of Unity*, check the Precomputed Realtime GI checkbox (still within Window> Lighting). There are two settings which both heavily affect CPU usage. Realtime Resolution and CPU Usage.

Realtime Resolution determines how many texels per unit should be computed. Unity* published a tutorial that goes into detail on how to properly set this value. A useful rule of thumb is that visually rich indoor scenes require more texels per unit to achieve as much realism as possible. In large outdoor scenes, indirect lightning transitions are not as noticeable, allowing the compute power to be spent elsewhere.
CPU Usage determines how many of the engine’s available worker threads will be leveraged for the realtime GI computation. It is best practice to determine the amount of CPU power available on various system levels and set this accordingly. For lower-end systems it’s best to keep this low/medium; for higher-end systems it’s better to use high or unlimited. Descriptions of these settings can be found in the Unity* documentation shipped with versions that expose them.

Settings in Unity* 5.6.1f1

Settings in older versions of Unity*

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Assertions;

public class GIController : MonoBehaviour {

    public static GIController Singleton = null;

    void Awake()
    {
        if (!Singleton)
        {
            Singleton = this;
            Debug.Log("Creating GIController");
        }
        else
        {
            Assert.IsNotNull(Singleton, "(Obj:" + gameObject.name + ") Only 1 instance of GIController needed at once");
            DestroyImmediate(this);
        }
    }

    public void Init()
    {
        Debug.Log("Initializing GIController");
    }

    void Start () {
        SetCPULevel(CPUCapabilityManager.Singleton.CPUCapabilityLevel);
    }

    public void SetCPULevel(CPUCapabilityManager.SYSTEM_LEVELS sysLevel)
    {
        if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.HIGH)
        {
            DynamicGI.updateThreshold = 0;
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.MEDIUM)
        {
            DynamicGI.updateThreshold = 25;
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.LOW)
        {
            DynamicGI.updateThreshold = 50;
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.OFF)
        {
            DynamicGI.updateThreshold = 100;
        }
        Debug.Log("(" + gameObject.name + ") System capability set to: " + CPUCapabilityManager.Singleton.CPUCapabilityLevel + ", so setting GI update threshold to: " + DynamicGI.updateThreshold);
    }
}

Dynamic Asteroids

Overview

When the Aegis navigates asteroid fields, additional asteroids are generated outside the view frustum of the player and launched into view. These asteroids collide with existing in-place asteroids and kick off dust.

Many of the games maps also contain asteroid field generators; these field generators scatter large static asteroids within a cylindrical or spherical zone. When high-end CPU effects are enabled, these zones also place dynamic asteroids with Rigidbody physics at a certain distance away from the ship while it’s moving. This helps give the impression that the asteroid field is full of smaller fragments colliding with each other and the larger asteroids. There is, additionally, a small chance a dynamic asteroid will spawn with a velocity already applied to keep things moving and the scene active. Finally, some asteroids will break apart into smaller fragments when colliding with the player’s ship or other asteroids, while others will bounce off but remain intact.

These changes have the effect of pulling the player’s attention away from the skybox, creating a sense that the player truly is in space; all without disrupting gameplay.

What Makes it CPU Intensive?

Having large numbers of dynamic asteroid fragments flying around in asteroid fields using Rigidbody physics, instantiating un-pooled fragments while moving and generating additional fragments when asteroids break apart all use a lot of CPU time.

Is it Built into Unity*?

The dynamic asteroids use Unity’s* Rigidbody Physics and Particles Systems, but the system to generate the asteroids was written and customized by the Star Trek™: Bridge Crew* team. Check out the sample below to see how you can implement a similar system yourself.

How it’s Done

If the player’s machine is capable, previously static models in the scene that don’t need to remain static can have Rigidbody physics enabled. This can be done dynamically in script by adding new Rigidbody components to existing objects, or by generating prefabs of preconfigured objects on-the-fly. Dynamic objects and objects that can be interacted with do a lot to increase immersion in games, especially in VR.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Assertions;

public class StaticDynamicController : MonoBehaviour {

    public static StaticDynamicController Singleton = null;
    public GameObject[] PotentiallyDynamicObjects;
    int NumDynamicObjects = 0;

    void Awake()
    {
        if (!Singleton)
        {
            Debug.Log("Creating StaticDynamicController");
            Singleton = this;
        }
        else
        {
            Assert.IsNotNull(Singleton, "(Obj:" + gameObject.name + ") Only 1 instance of GIController needed at once");
            DestroyImmediate(this);
        }
    }

    public void Init()
    {
        Debug.Log("Initializing StaticDynamicController");
    }

    void Start () {
        SetCPULevel(CPUCapabilityManager.Singleton.CPUCapabilityLevel);
    }

    public void SetCPULevel(CPUCapabilityManager.SYSTEM_LEVELS sysLevel)
    {
        if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.HIGH)
        {
            NumDynamicObjects = PotentiallyDynamicObjects.Length;
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.MEDIUM)
        {
            NumDynamicObjects = PotentiallyDynamicObjects.Length / 2;
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.LOW)
        {
            NumDynamicObjects = PotentiallyDynamicObjects.Length / 3;
        }
        else if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.OFF)
        {
            NumDynamicObjects = 0;
        }

        Debug.Log("(Obj:" + gameObject.name + ") System capability set to: " + CPUCapabilityManager.Singleton.CPUCapabilityLevel + ", so setting number of dynamic objects to: " + NumDynamicObjects);

        for (int i = 0; i < NumDynamicObjects; i++)
        {

            Rigidbody objRigidBody = PotentiallyDynamicObjects[i].AddComponent();
            objRigidBody.useGravity = true;
            objRigidBody.mass = 10;
            PotentiallyDynamicObjects[i].AddComponent();
        }
    }
}

Cloud Wakes and Solar Flares

Overview

Cloud wakes increase immersion by creating the illusion that enemy ships and the Aegis are displacing dust as they move through space. Solar flares accomplish the same thing by distracting the eye from the skybox, making the player feel like they are in the far reaches of space.

What Makes it CPU Intensive?

The cloud wakes and solar flares use scripted particle behaviors which require updating the particles individually using a script on the main thread. Looping through several hundred to a few thousand particles and updating their properties through script uses a lot of CPU time, but allows custom behavior for particles that wouldn’t be possible using the normal particle system properties offered out of the box in Unity*. Keep in mind that this must currently be done on the main thread, so this system can’t go as wide on the cores as the previously mentioned particle collision system. Stay tuned for Unity’s* new C# job system mentioned at Unite Europe 2017 which will extend the Unity* API to allow better multi-threading in script code.

Is it Built into Unity*?

Cloud wakes and solar flares use Unity’s* Particle System, but how the particles move and change over time was scripted by Red Storm Entertainment. The wake effect emits particle trails from several emitter points on the ship using a single particle system. The size and lifetime of the particles in a single trail are based on its emitter. The trail particles are emitted in world space, but the emitter points stay attached to the ship so that they continue to emit from the correct locations as the ship turns and banks. The custom particle behavior script adds virtual “attractor” objects behind the ship that oscillate randomly to pull nearby particles towards them, introducing turbulence to the trails behind the ships while passing through clouds. The solar flares also use the attractor behavior to either splash the particles outward or pull them back towards the sun’s surface after initially having been emitted outward. The following simple example shows how to make all particles head towards the world origin.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class ParticleBehavior : MonoBehaviour {

    public ParticleSystem MyParticleSystem;
    ParticleSystem.Particle[] MyParticles = new ParticleSystem.Particle[4000];
    public float ParticleSpeed = 10.0f;

	void Update () {
        int numParticles = MyParticleSystem.GetParticles(MyParticles);

        for(int i = 0; i < numParticles; i++)
        {
            MyParticles[i].position = Vector3.MoveTowards(MyParticles[i].position, Vector3.zero, Time.deltaTime * ParticleSpeed);
        }

        MyParticleSystem.SetParticles(MyParticles, numParticles);
	}
}

Ship Destruction

Overview

The ship destruction feature enhances the game by giving players a more satisfying feeling when defeating an enemy. Traditionally in games, a trick is used to occlude exploding objects with an explosion effect to mask the popping effect when removing a discarded GameObject from the scene. With the available CPU power in higher-end setups, we can split the model into pieces and launch them all in different directions, and even add sub-destruction. Each piece can collide with scene dressings and then ultimately disappear or linger if the system can handle it.

What Makes it CPU Intensive?

The ships are broken into many different parts by the artists that all contain Rigidbody components, and animated via physics forces when they're initialized. Collision with other objects (i.e., asteroids, ships) was enabled to ensure realistic behavior when animated in the environment. Furthermore, each exploded ship part had particle trails attached to them.

Is it Built into Unity*?

The Rigidbody and physics aspect of this feature are entirely built in, and Unity*-specific methods are used to add explosion forces to the ship parts. Afterwards, they are animated and collide with objects using Unity’s* Rigidbody Physics system. A Unity* Particle System is used to emit particles that have sub-emitters to create trails behind the pieces, but the top-level particle positions are managed in script to ensure they remained attached to the exploded ship parts without worrying about parent coordinate spaces.

How it’s Done

Build out your models in pieces separated by various break points. Outfit each game object containing a mesh renderer in Unity* with a Rigidbody component. When the object should be destroyed, enable the Rigidbody components on each mesh and apply an explosive force to all of them. See Unity’s* Rigidbody documentation for more details.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Assertions;

public class ExplosionController : MonoBehaviour {

    // Explosion arguments
    public float ExplosiveForce;
    public float ExplosiveRadius;
    public Transform ExplosiveTransform;    // Centerpoint of explosion

    public Rigidbody BaseRigidBody;
    public GameObject[] PotentiallyDetachableCubes;
    List ObjRigidbodies = new List();
    bool IsCPUCapable = false;
    bool HasExploded = false;

	void Start ()
    {
        SetCPULevel(CPUCapabilityManager.Singleton.CPUCapabilityLevel);
    }

    public void SetCPULevel(CPUCapabilityManager.SYSTEM_LEVELS sysLevel)
    {
        // Only use if CPU deemed medium or high capability
        if (sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.HIGH
            || sysLevel == CPUCapabilityManager.SYSTEM_LEVELS.MEDIUM)
        {
            IsCPUCapable = true;

            // add rigidbodies to all little cubes
            for (int i = 0; i < PotentiallyDetachableCubes.Length; i++)
            {
                Rigidbody CurrRigidbody = PotentiallyDetachableCubes[i].AddComponent();
                CurrRigidbody.isKinematic = true;
                CurrRigidbody.useGravity = false;
                ObjRigidbodies.Add(CurrRigidbody);
            }
            Debug.Log("(ExplosionController) System capability set to: " + CPUCapabilityManager.Singleton.CPUCapabilityLevel + ", so object (" + gameObject.name + ") is destructible");
        }
        else
        {

            Debug.Log("(ExplosionController) System capability set to: " + CPUCapabilityManager.Singleton.CPUCapabilityLevel + ", so object (" + gameObject.name + ") not destructible");
        }
    }

    public void ExplodeObject()
    {
        HasExploded = true;
        if (IsCPUCapable)
        {
            BaseRigidBody.useGravity = false;
            BaseRigidBody.isKinematic = true;
            BoxCollider[] BaseColliders = GetComponents();
            for(int i = 0; i < BaseColliders.Length; i++)
            {
                BaseColliders[i].enabled = false;
            }
            for (int i = 0; i < ObjRigidbodies.Count; i++)
            {
                Rigidbody CurrRigidbody = ObjRigidbodies[i];
                CurrRigidbody.isKinematic = false;
                CurrRigidbody.useGravity = true;
                CurrRigidbody.AddExplosionForce(ExplosiveForce, ExplosiveTransform.position, ExplosiveRadius);
                ObjRigidbodies[i].gameObject.AddComponent();
            }
        }
        else
        {
            // Boring destruction implementation
            BaseRigidBody.AddExplosionForce(ExplosiveForce, ExplosiveTransform.position, ExplosiveRadius);
        }
    }

    void OnCollisionEnter(Collision collision)
    {
        if(!HasExploded)
        {
            ExplosiveTransform.position = collision.contacts[0].point;
            ExplodeObject();
        }
    }
}

CPU Capability Detection Plugin

Ok, so we’ve been through each of the features added to Star Trek™: Bridge Crew*, but how do we determine what our target system can handle? To make this as painless as possible, we’ve created an easy-to-use Unity* plugin with source. The code comes with example code written for both Unity* and native implementations and acts as a toolbox to easily get you system metrics to help you define your target system categories. Many of the above examples are integrated into the sample to make it easy to hit the ground running. Here are the steps:

Define your CPU performance tiers.

public enum SYSTEM_LEVELS
    {
        OFF,
        LOW,
        MEDIUM,
        HIGH,
        NUM_SYSTEMS
    };

Set your CPU value thresholds. Various metrics are supplied from the plugin, such as logical/physical core count, max frequency, system memory, etc. However, you can always add your own if you’d like to consider other factors. For most basic uses, the supplied metrics should suffice.

        // i5-4590
        LowSettings.NumLogicalCores = 4;
        LowSettings.UsablePhysMemoryGB = 8;
        LowSettings.MaxBaseFrequency = 3.3;
        LowSettings.CacheSizeMB = 6;

        // i7 - 7820HK - Set to turbo mode
        MedSettings.NumLogicalCores = 8;
        MedSettings.UsablePhysMemoryGB = 8;
        MedSettings.MaxBaseFrequency = 3.9;
        MedSettings.CacheSizeMB = 8;

        // i7-6700k
        HighSettings.NumLogicalCores = 8;
        HighSettings.UsablePhysMemoryGB = 8;
        HighSettings.MaxBaseFrequency = 4.0;
        HighSettings.CacheSizeMB = 8;

Initialize the plugin and determine if the user is running on an Intel® processor.

void QueryCPU()
    {
        InitializeResources();
        if (IsIntelCPU())
        {
            // Your performance categorization code
        }
        else
        {
            Debug.Log("You are not running on an Intel CPU");
        }
    }

Query the target system.

StringBuilder cpuNameBuffer = new StringBuilder(BufferSize);
            GetProcessorName(cpuNameBuffer, ref BufferSize);
            SysLogicalCores = GetNumLogicalCores();
            SysUsablePhysMemoryGB = GetUsablePhysMemoryGB();
            SysMaxBaseFrequency = GetMaxBaseFrequency();
            SysCacheSizeMB = GetCacheSizeMB();

Compare your threshold values to determine which previously defined performance tier the system tested belongs in.

bool IsSystemHigherThanThreshold(SystemThreshold threshold)
    {
        if (threshold.NumLogicalCores < SysLogicalCores && threshold.MaxBaseFrequency < SysMaxBaseFrequency&& threshold.UsablePhysMemoryGB < SysUsablePhysMemoryGB && threshold.CacheSizeMB < SysCacheSizeMB)
        {
            return true;
        }
        return false;
    }

SYSTEM_LEVELS MySystemLevel = SYSTEM_LEVELS.OFF;

if (IsSystemHigherThanThreshold(HighSettings) || IsWhitelistedCPU(SYSTEM_LEVELS.HIGH))
        {
            MySystemLevel = SYSTEM_LEVELS.HIGH;
        }
        else if (IsSystemHigherThanThreshold(MedSettings) || IsWhitelistedCPU(SYSTEM_LEVELS.MEDIUM))
        {
            MySystemLevel = SYSTEM_LEVELS.MEDIUM;
        }
        else if (IsSystemHigherThanThreshold(LowSettings) || IsWhitelistedCPU(SYSTEM_LEVELS.OFF))
        {
            MySystemLevel = SYSTEM_LEVELS.LOW;
        }
        else
        {
            MySystemLevel = SYSTEM_LEVELS.OFF;
        }

        Debug.Log("Your system level has been categorized as: " + MySystemLevel);

Performance Profiling and Considerations

Just like with GPU work, we need to verify that our feature set’s combined CPU utilization doesn’t exceed our low, medium, and high targets to cause constant Asynchronous Spacewarp (trying very hard to resist terrible Star Trek™ pun) and reprojection triggers. We wanted to make sure that the game maintained a consistent 90 frames per second while still maximizing the CPU, no matter what machine the game was running on. The Star Trek™: Bridge Crew* team decided on three levels of feature sets: Off, Partial, and Full. So, we tested the Full group of features on a machine which matched with our Off threshold.

GPUView showing work distribution on desktop system with HSW i5-4590 CPU + GTX 1080 GPU

CPU	Graphics Card	Scenario	Configuration	Run	Refresh Intervals	New Frames	Dropped Frames	Synthetic Frames Generated
HSW i5-4590	GTX1080	Mission 5 after initial warp	Full Settings	1	11861	5993	58	5810
				2	11731	6584	56	5091
				3	11909	6175	101	5633
		Averages			11833.67	6250.67	71.67	*5511.33*

Number of synthetic frames generated being non-zero indicates the CPU work exceeded 11.1 ms per frame threshold with the full feature set on the lower end CPU

The above GPUView screenshot shows ~22 ms of time passing from present to present (highlighted). Present indicates when the final frame has been generated and is ready for submission to the head mounted display (HMD). This can be thought of in terms of frame rate (converting to 45 fps). Going from 90 to 45 fps means that we are consistently triggering ASW with this configuration running on our ‘Off’ tier system. Looking at three test runs over Mission 5, we see an average of ~5.5k synthetic frames being generated because of ASW triggers. Squeezing these immersive features onto Oculus min-spec didn’t work out, as we expected. But rather than keeping feature sets off across all configurations, we bound feature sets to hardware levels we could determine at run-time to activate the appropriate set, allowing players with all hardware levels to experience the game as best as it should be experienced. If we look at the same configuration running on our high-end target (Intel® Core™ i7-7700K processor), we see things change.

GPUView showing work distribution on desktop system with KBL i7-7700K CPU + GTX 1080 GPU

CPU	Graphics Card	Scenario	Configuration	Run	Refresh Intervals	New Frames	Dropped Frames	Synthetic Frames Generated
KBL i7-7700k	GTX1080	Mission 5 after initial warp	Full Settings	1	11703	11666	37	0
				2	11654	11617	37	0
				3	11700	11672	28	0
		Averages			11685.67	11651.67	34.00	*0.00*

Number of synthetic frames generated being zero indicates the CPU work never exceeded 11.1 ms per frame threshold with full feature set on the higher end CPU

With the additional logical cores, increased frequency, and bigger cache size of our high-end target, all the work can sprawl out and complete within the allotted 11.1 ms required to hit 90 fps. The average duration of CPU work per frame ranges from 9-10.3 ms from head to tail. This means we are pushing our high-end target nearly to its limit, but still maintaining a solid 90 fps and utilizing all of the resources available to us. We’ve hit the sweet spot! Ok, so we’ve got our ‘Off’ and ‘Full’ feature sets tested. At this point, we needed to select a subset of the ‘Full’ features to enable on the Intel Core i7-7700HK processor-based VR-ready notebooks. This is our mid-target for the ‘Partial’ feature set. We wanted to keep features that really affected the inside of the bridge, so we prioritized those and slowly removed the others one by one until we hit the sweet spot. Eventually, we only had to cut the dynamic wake effects and dynamic asteroids to comfortably push out 90 fps on the laptop. Here is a screen capture of GPUView* showing the ‘Partial’ feature set running on our test VR–ready notebook.

GPUView showing work distribution on VR Gaming Laptop with KBL i7-7820HK CPU + GTX 1080 GPU

CPU	Graphics Card	Scenario	Configuration	Run	Refresh Intervals	New Frames	Dropped Frames	Synthetic Frames Generated
KBL i7-7820HK	GTX1080	Mission 5 after initial warp	Full Settings	1	11887	11242	116	529
				2	11881	11315	110	456
				3	11792	10912	125	755
		Averages			11853.33	11156.33	117.00	*580.00*

Number of synthetic frames generated being non-zero indicates the CPU work exceeded 11.1 ms per frame threshold with the full feature set on the VR ready laptop

CPU	Graphics Card	Scenario	Configuration	Run	Refresh Intervals	New Frames	Dropped Frames	Synthetic Frames Generated
KBL i7-7820HK	GTX1080	Mission 5 after initial warp	Partial Settings	1	11882	11844	38	0
				2	10171	10146	25	0
				3	11971	11933	38	0
		Averages			11341.33	11307.67	33.67	*0.00*

Number of synthetic frames generated being zero indicates the CPU work never exceeded 11.1 ms per frame threshold with the partial feature set on the VR ready laptop

Conclusion

Overall, CPU usage increases the most from the use of more realistic, higher resolution simulations as well as the existence of more dynamic entities; physics simulations previously thought to be too expensive are now something that can be enabled on many CPUs. Additionally, various other CPU intensive systems like animation/Inverse Kinematics (IK), cloth simulation, flocking, fluid simulation, procedural generation, and more can be used to create a more rich and realistic world. The industry has had settings tiers for graphics for a while now and it’s time we start thinking the same way about CPU settings. When developing a game, think about all your untapped compute potential on different hardware levels and consider how it can harnessed to make your game something special. Check the links below for more information. Happy hunting.

Special Thanks to Kevin Coyle and the rest of the Red Storm Entertainment team who worked with us on this partnership and helped put together this article **

Additional Resources

“Set Graphics to Stun: Enhancing VR Immersion with the CPU in Star Trek™: Bridge Crew*”

The author presented the information in this article at Unite 2016.

Session Description– Many games and experiences these days put a huge emphasis on GPU work and let the many cores built in to modern mainstream CPUs sit idle on the sideline. This talk explores how Ubisoft's Red Storm studio and Intel partnered to push immersion as far as possible in Star Trek™: Bridge Crew using Unity* to take advantage of these available resources. Learn how you can achieve stunning visuals with minimal performance impact on the GPU in your own games!

Catlike Coding

Catlike Coding offers a number of great CPU/math-heavy tutorials that anybody can pick up and run with. The tutorials are Unity*-focused but can apply to any other engine as the meat of the content doesn’t depend on any particular API. It’s highly recommended for those interested in procedural generation, leveraging curves/splines, mesh deformation, texture manipulation, noise, and more.

Fluid Simulation for Video Games (Series)

This is a well-written tutorial on implementing fluid simulation for video games that leverage many cores. The article is great for beginners, walking them through everything from concept to implementation. By the end of the article the reader will have source code to add to their own engine and an understanding of how to manipulate the code to emulate various fluid types.

Link: https://software.intel.com/en-us/articles/fluid-simulation-for-video-games-part-1

↧

VR Content Developer Guide

August 2, 2017, 2:02 pm

Latest and popular articles on Intel Technologies

≫ Next: How to tell CPU model, when running under hypervisor that spoofs CPUID

≪ Previous: Enhancing VR Immersion with the CPU in Star Trek™: Bridge Crew

Get general guidelines for developing and designing your virtual reality (VR) application, and learn how to obtain optimal performance. This guide is based on performance characterizations across several VR workloads, and defines common bottlenecks and issues. Find solutions to address black and white-bound choice-of-texture formats, fusing shader passes, and how to use post-anti-aliasing techniques to improve the performance of VR application workloads.

Goals

Define general design point and budget recommendations for developers on creating VR content using 7th generation Intel® Core™ i7 processors and Intel® HD Graphics 615 (GT2).
Provide guidance and considerations for obtaining optimal graphics performance on 7th generation Intel® Core™ i7 processors.
Provide suggestions for optimal media, particularly 3D media.
Get tips on how to design VR apps for sustained power, especially for mobile devices.
Identify tools that help developers identify VR issues in computer graphics on VR-ready hardware.

Developer Recommended Design Points

General guidelines on design points and budgets to ISVs

Triangles/Frame - 200 K - 300 K visible triangles in a given frame.* Use aggressive culling of view frustum, back face, and occlusion to reduce the number of actual triangles sent to the GPU.
Draws/Frame - 500 - 1000*. Reduce number of draw calls to improve performance and power. Batch draws by shader and draw front-to-back with 3D workloads (refer to 3D guide).
Target Refresh - At least 60 frames per second (fps), 90 fps for best experience.
Resolution - Resolution of head-mounted display (HMD) can downscale if needed to hit 60 fps but cannot go below 80 percent of HMD resolution.* Dynamic scaling of render target resolution can also be considered to meet frame rate requirements.*
Memory - 180 MB ‒ 200 MB per frame (DDR3, 1600 MHz) for 90 fps.*

*Data is an initial recommendation and may change.

Considerations for Optimal Performance on General Hardware

Texture Formats and Filtering Modes

Texture formats and filtering modes can have a significant impact on bandwidth.
Generally 32-bit and 64-bit image formats are recommended for most filtering modes (bilinear and so on).
Filtering trilinear and volumetric surfaces with standard red green blue and high-dynamic range (sRGB/HDR) formats will be slower compared to 32-bit cases.

Uncompressed Texture Formats

Uncompressed formats—sRGB and HDR —consume greater bandwidth. Use linear formats if the app becomes heavily bandwidth-bound.

HDR Formats

The use of R10G10B10A2 over R16G16B16A16 and floating point formats is encouraged.

Filtering Modes

Filtering modes, like anisotropic filtering, can have a significant impact on performance, especially with uncompressed formats and HDR formats.

Anisotropic [CC6] filtering is a trade-off between performance and quality. Generally anisotropic level two is recommended based on our performance and quality studies. Mip-Mapping textures along with anisotropic levels add overhead to the filtering and hardware pipeline. If you chose anisotropic filtering, we recommend using bc1‒5 formats.

Anti-Aliasing

Temporally stable anti-aliasing is crucial for a good VR experience. Multisample anti-aliasing (MSAA) is bandwidth intense and consumes a significant portion of the rendering budget. Anti-aliasing algorithms that are temporally stable post-process, like TSCMAA can provide equivalent functionality at half the cost and should be considered as alternatives.

Low-Latency Preemption

Gen hardware supports object-level preemption, which usually translates into preemption on triangle boundaries. For effective scheduling of the compositor, it is important that the primitives can be preempted in a timely fashion. To enable this, draw calls that take more than 1 ms should usually have more than 64‒128 triangles. Typically, full-screen post-effects should use a grid of at least 64 triangles as opposed to 1 or 2.

App Scheduling

1. Recommendation:Nothing additional is required.

screenshot of frame rendering values

In the ideal case for a given frame, the app will have ample time to complete its rendering work between the vsync and before the link state request (LSR) packet is submitted. In this case, it is best that the app synchronize on the vsync itself, so that rendering is performed on the newest HMD positional data. This helps to minimize motion sickness.

2. Recommendation: Start work earlier by syncing on the compositor starting work rather than vsync.

screenshot of frame rendering values

When the frame rendering time no longer fits within this interval, all available GPU time should be reclaimed for rendering the frame before theLSR occurs. If this interval is not met, the compositor can block the app from rendering the next frame by withholding the next available render target in the swap chain. This results in entire frames being skipped until the present workload for that frame has finished, causing a degradation of fps for the app. The app should synchronize with its compositor so that new rendering work is submitted as soon as the present or LSR workload is submitted. This is typically accomplished via a wait behavior provided by the compositor API.

vector image

3. Recommendation: Present asynchronously.

In the worst case, when the frame rendering time exceeds the vsync, the app should submit rendering work as quickly as possible to fully saturate the GPU to allow the compositor to use the newest frame data available, whenever that might occur relative to the LSR. To accomplish this, do not wait on any vsync or compositor events to proceed with rendering, and if possible build your application so that the presentation and rendering threads are decoupled from the rest of the state update.

For example, on the Holographic API, pass DoNotWaitForFrameToFinish to PresentUsingCurrentPrediction, or in DirectX*, pass SyncInterval=0 to Present.

4. Recommendation: Present asynchronously.

Use GPU analysis tools, such as GPUView, to see which rendering performance profile you have encountered, and make the necessary adjustments detailed above.

Other design considerations

Half float versus float: For compute-bound workloads, half floats can be used to increase throughput where precision is not an issue. Mixing half and full resolution results in performance penalties and should be minimized.

Tools

The following tools will help you identify issues with VR workloads.

GPU View: Gives specifics on identifying issues with scheduling and dropped frames.

Intel® Graphics Performance Analyzers: Gives specifics on analyzing VR workloads and the expected patterns we see. For example, two sets of identical calls for the left and right eyes.

Additional Resources

Summary

The biggest challenge to performance for VR workloads comes from being bandwidth-bound. The texture format, fusing shader passes, and using post anti-aliasing techniques help reduce the pressure on bandwidth.

Contributors

The developer guidelines provided in this document are created with input from Katen Shah, Bill Sadler, Prasoon Surti, Mike Apodaca, Rama Harihara, Travis Schuler, and John Gierach.

↧

How to tell CPU model, when running under hypervisor that spoofs CPUID

August 8, 2017, 5:30 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :50014

≪ Previous: VR Content Developer Guide

When running in a virtual machine, you may never be sure which physical CPU you are running on - hypervisor can pass anything as CPUID.
For best performance, it helps to use the best instruction set supported by a physical CPU - be it AVX512, AVX2, AVX, SSE4.1, AES-NI, or other accelerated instruction sets. Enhanced Platform Awareness features use top-down approach to close this gap, but bottom-up approach is also possible.

↧

testing points :50014

August 9, 2017, 11:34 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :77351

≪ Previous: How to tell CPU model, when running under hypervisor that spoofs CPUID

test

↧

testing points :77351

August 9, 2017, 11:34 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :76959

≪ Previous: testing points :50014

test

↧

testing points :76959

August 9, 2017, 11:34 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :29037

≪ Previous: testing points :77351

test

↧

testing points :29037

August 9, 2017, 11:34 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :28668

≪ Previous: testing points :76959

test

↧

testing points :28668

August 9, 2017, 11:34 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :20838

≪ Previous: testing points :29037

test

↧

testing points :20838

August 9, 2017, 11:34 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :75007

≪ Previous: testing points :28668

test

↧

testing points :75007

August 9, 2017, 11:35 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :93443

≪ Previous: testing points :20838

test

↧

testing points :93443

August 9, 2017, 11:35 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :22006

≪ Previous: testing points :75007

test

↧

testing points :22006

August 9, 2017, 11:35 am

Latest and popular articles on Intel Technologies

≫ Next: testing points :40392

≪ Previous: testing points :93443

test

↧

testing points :40392

August 9, 2017, 11:35 am

Latest and popular articles on Intel Technologies

≫ Next: Complexity Sciences Center, University of California, Davis

≪ Previous: testing points :22006

test

↧

Complexity Sciences Center, University of California, Davis

August 9, 2017, 2:41 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Parallel Computing Center at University of Oxford

≪ Previous: testing points :40392

Principal Investigators:

Jim Crutchfield teaches nonlinear physics at the University of California, Davis, directs its Complexity Sciences Center, and promotes science interventions in nonscientific settings. He's mostly concerned with what patterns are, how they are created, and how intelligent agents discover them; see http://csc.ucdavis.edu/~chaos/

Description:

A novel approach to data-driven prediction and unsupervised learning of coherent structures in climate dynamics. Extend our unsupervised machine learning methods in two fundamental ways. The first is that our methods will facilitate pattern discovery—inferring both known patterns and novel, as-yet-unseen patterns and coherent structures from the data. Let the data to tell us the appropriate representations to use to describe patterns, as opposed to selecting a single favorite functional basis or trial-and-error tests to compare them. The second is to adapt our methods to spatiotemporal data—data in which spatial configurations (e.g., velocity vector fields) evolve over time. The goal is to implement structural inference in a principled way that naturally includes temporal dynamics. A wholly new approach, such as this, facilitates the discovery of emergent dynamical patterns in spatiotemporal data is ideally matched to the fundamental algorithmic challenges posed in climate modeling.

Related websites:

http://csc.ucdavis.edu/

http://csc.ucdavis.edu/~chaos/

http://informationengines.org/

↧

Intel® Parallel Computing Center at University of Oxford

August 9, 2017, 2:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Parallel Computing Center at The Molecular Sciences Software Institute

≪ Previous: Complexity Sciences Center, University of California, Davis

Principal Investigators:

Dr. Wood is an associate professor in the Department of Engineering Science at the University of Oxford, a Fellow of the Alan Turing Institute, a Governing Body Fellow of Kellogg College, and a founder of Invrea, Ltd and infinitemonkeys.ai. Formerly Dr. Wood was an assistant professor of Statistics at Columbia University and a postdoctoral fellow of the Gatsby Computational Neuroscience Unit of the University College London. He received his PhD from Brown University in computer science and his BS from Cornell University. Dr. Wood has raised over $5M from DARPA, BP, Google, Intel, and Microsoft. Prior to his academic career Dr. Wood was a successful entrepreneur.

Description:

In collaboration with NYU, LBNL, and Intel we propose to use probabilistic programming techniques to filter and explain high energy physics experiments by automatically inferring the structure of events, including the particles produced and their properties, directly from observed experimental results. The physics community already has a series of simulation software tools both for the underlying physics and for the modeling of the interaction of the underlying particles with experimental detector. Bringing these together using probabilistic programming powered by massively parallel high performance computing will enable us to tackle the fundamental inference problem in particle physics directly for the first time, offering a new way for particle physicists to tackle the detection of novel physics signatures, ultimately at Large Hadron Collider data and computation scale.

Related websites:

http://www.robots.ox.ac.uk/~fwood

↧

Intel® Parallel Computing Center at The Molecular Sciences Software Institute

August 9, 2017, 2:57 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Parallel Computing Center at University of California, Berkeley

≪ Previous: Intel® Parallel Computing Center at University of Oxford

Principal Investigators:

Prof. T. Daniel Crawford’s research expertise includes the development of high-accuracy quantum chemical models for the spectroscopic properties of chiral molecules in both gas and liquid phases. For more than 20 years he has been a lead developer of the PSI quantum chemistry package, 55 which was one of the first electronic structure packages to be distributed under a fully open-source license and is used by thousands of molecular scientists worldwide. He is the 2010 winner of the Dirac Medal of the World Association of Theoretical and Computational Chemists (WATOC).

Description:

The Molecular Sciences Software Institute (MolSSI) is a new initiative funded by the U.S. National Science Foundation to serve as a nexus for science, education, and cooperation for the community of computational molecular scientists — a broad field that includes biomolecular simulation, quantum chemistry, and materials science. The MolSSI’s will provide software-engineering expertise, education, and leaderships to enable molecular scientists to tackle problems that are orders of magnitude larger and more complex than those currently within our grasp. The MolSSI is a joint effort by Virginia Tech, Rice University, Stony Brook University, U.C. Berkeley, Rutgers University, the University of Southern California, Stanford University, and Iowa State University.

Related Websites:

molssi.org

↧

Intel® Parallel Computing Center at University of California, Berkeley

August 9, 2017, 4:39 pm

Latest and popular articles on Intel Technologies

≫ Next: The inside scoop on how we accelerated NumPy Umath functions

≪ Previous: Intel® Parallel Computing Center at The Molecular Sciences Software Institute

Principal Investigators:

Jeffrey Regier is a postdoctoral research at UC Berkeley in the Department of Electrical Engineering and Computer Science . His research focuses on Bayesian modeling, variational inference, and optimization for large-scale scientific applications. Jeff holds a PhD in statistics from UC Berkeley, as well as MS degrees in mathematics (UC Berkeley) and computer science (Columbia University).

Description:

Astronomical surveys are the primary source of information about the Universe beyond our solar system. They are essential for addressing key open questions in astronomy and cosmology about topics such as the life-cycles of stars and galaxies, the nature of dark energy, and the origin and evolution of the Universe.

We are developing new methods for constructing catalogs of light sources, such as stars and galaxies, for astronomical imaging surveys. These catalogs are generated by identifying light sources in survey images and characterizing each according to physical parameters such as brightness, color, and morphology. Astronomical catalogs are the starting point for many scientific analyses, such as theoretical modeling of individual light sources, modeling groups of similar light sources, or modeling the spatial distribution of galaxies. Catalogs also inform the design and operation of follow-on surveys using more advanced or specialized instrumentation (e.g., spectrographs). For many downstream analyses, accurately quantifying the uncertainty of parameters' point estimates is as important as the accuracy of the point estimates themselves.

Our approach is based on Bayesian inference---a highly accurate method that is notorously demanding computationally. We use supercomputers containing the latest Intel hardware to quickly solve our Bayesian inference problems.

Related websites:

https://github.com/jeff-regier/Celeste.jl

https://people.eecs.berkeley.edu/~jregier/

↧

The inside scoop on how we accelerated NumPy Umath functions

August 10, 2017, 2:46 pm

Latest and popular articles on Intel Technologies

≫ Next: Inference of Caffe* and TensorFlow* Trained Models with Intel’s Deep Learning Deployment Toolkit

≪ Previous: Intel® Parallel Computing Center at University of California, Berkeley

NumPy UMath Optimizations

One of the great benefits found in our Intel® Distribution for Python is the performance boost gained from leveraging SIMD and multithreading in (select) NumPy’s UMath arithmetic and transcendental operations, on a range of Intel CPUs, from Intel® Core™ to Intel® Xeon™ & Intel® Xeon Phi™. With stock python as our baseline, we demonstrate the scalability of Intel® Distribution for Python by using functions that are intensively used in financial math applications and machine learning:

One can see that stock Python (pip-installed NumPy from PyPI) on Intel® Core™ i5 performs basic operations such as addition, subtraction, and multiplication just as well as Intel Python, but not on Intel® Xeon™ and Intel® Xeon Phi™, where Intel Python adds at least another 10x speedup. This can be explained by the fact that basic arithmetic operations in stock NumPy are hard-coded AVX intrinsics (and thus already leverage SIMD, but do not scale to other ISA, e.g. AVX-512). These operations in stock Python also do not leverage multiple cores (i.e. no multi-threading of loops under the hood of NumPy exist with such operations). Intel Python’s implementation allows for this scalability by utilizing the following: respective Intel® MKL VML primitives, which are CPU-dispatched (to leverage appropriate ISA) and multi-threaded (leverage multiple cores) under the hood, and Intel® SVML intrinsics, a compiler-provided short vector math library that vectorizes math functions for both IA-32 and Intel® 64-bit architectures on supported operating systems. Depending on the problem size, NumPy will choose one of the two approaches. On much smaller array sizes, Intel® SVML outperforms VML due to VML’s inherent cost of setting up the environment to multi-thread loops. For any other problem size, VML outperforms SVML and this is thanks to VML’s ability to both vectorize math functions and multi-thread loops.

Specifically, on Intel® Core® i5 Intel Python delivers greater performance on transcendentals (log, exp, erf, etc.) due to utilizing both SIMD and multi-threading. We do not see any visible benefit of multi-threading basic operations (as shown on the graph) unless NumPy arrays are very large (not shown on the graph). On Xeon®, the 10x-1000x boost is explained by leveraging both (a) AVX2 instructions in transcendentals and (b) multiple cores (32 in our setup). Even greater scalability of Xeon Phi® relative to Xeon is explained by larger number of cores (64 in our setup) and a wider SIMD.

The following charts provide another view of Intel Python performance versus stock Python on arithmetic and transcendental vector operations in NumPy by measuring how close UMath performance is to respective native MKL call:

Again, on Intel® Core™ i5, the stock Python performs well on basic operations (due to hard-coded AVX intrinsics and because multi-threading from Intel Python does not add much on basic operations) but does not scale on transcendentals (loops with transcendentals are not vectorized in stock Python). Intel Python delivers performance close to native speeds (90% of MKL) on relatively big problem sizes. While running our umath optimization benchmarks on different architectures, it was discovered that the performance of the umath functions did not scale as one would expect on the Intel® Xeon Phi™. We identified an issue with Intel® OpenMP that made the MKL VML function calls perform poorly in multiprocessing mode. Our team is working closely with the Intel® MKL and Intel® OpenMP teams to resolve this issue.

To demonstrate the benefits of vectorization and multi-threading in a real-world application, we chose to use the Black Scholes model, used to estimate the price of financial derivatives, specifically European vanilla stock options. A Python implementation of the Black Scholes formula gives an idea of how NumPy UMath optimizations can be noticed at the application level:

One can see that on Intel® Core™ i5, the Black Scholes Formula scales nicely with Intel Python on small problem sizes but does not perform well on bigger problem sizes, which is explained by small cache sizes. Stock Python does marginally scale due to leveraging AVX instructions on basic arithmetic operations, but this is a whole different story on Intel® Xeon™ and Intel® Xeon Phi™. With Intel Python running the same Python code on server processors, much greater scalability on much greater problem sizes is delivered. Intel® Xeon Phi™ scales better due to bigger number of cores and as expected, the stock Python does not scale on server processors due to the lack of AVX2/AVX-512 support for transcendentals and no multi-threading utilization.

↧

Inference of Caffe* and TensorFlow* Trained Models with Intel’s Deep Learning Deployment Toolkit

August 11, 2017, 9:03 am

Latest and popular articles on Intel Technologies

≫ Next: What's new? Intel® SDK for OpenCL™ Applications 2017, R1

≪ Previous: The inside scoop on how we accelerated NumPy Umath functions

Install Deployment Toolkit

First, download Deployment Toolkit.

Then, install the Deployment Toolkit.

The application by default is installed in /opt/intel/deeplearning_XXX/(for root user) or in ~/intel/deeplearning_XXX/(for non-root user). Further in the tutorial we will refer to it as IE_INSTALL.

E.g. ~/intel/deeplearning_deploymenttoolkit_2017.1.0.4463/.

Read introduction to the tool.

Configure Model Optimizer (MO) for Caffe

Configure the Model Optimizer (MO) for Caffe.

Comments for MO configuration steps:

ModelOptimizer folder is

 IE_INSTALL/deployment_tools/model_optimizer

When copying files to Caffe, you can use the following commands, assuming you are in

 IE_INSTALL/deployment_tools/model_optimizer

Example command could be:

cp -R adapters/include/caffe/* CAFFE_HOME/include/caffe/
cp -R adapters/src/caffe/* CAFFE_HOME/src/caffe

To make sure that rebuild of Caffe catched new adapters, in console output you will see something like:

CXX src/caffe/LayerDescriptor.cpp
CXX src/caffe/CaffeCustomLayer.cpp
CXX src/caffe/CaffeAPI.cpp
CXX src/caffe/NetworkDescriptor.cpp

If you get a problem with hdf5 library on (at least on Ubuntu 16.04), use the following fix. By default the built library is in
```
CAFFE_HOME/build/lib
```

You need to set a variable FRAMEWORK_HOME pointing to it.

E.g. write in ~/.bashrc:

export FRAMEWORK_HOME="CAFFE_HOME/build/lib"
export LD_LIBRARY_PATH="CAFFE_HOME/build/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="IE_INSTALL/deployment_tools/model_optimizer/bin:$LD_LIBRARY_PATH"

After you have configured MO on the machine, you need to convert existing Caffe model to the IR.

Configure Model Optimizer (MO) for TensorFlow (TF)

Configure the Model Optimizer (MO) for Tensorflow.

Comments for MO configuration steps:

Install TensorFlow from sources. Further in this tutorial we will refer to the name of framework asTF.

Follow this link for full instruction.

Further in this tutorial, we will refer the directory, containing TF sources (be default it istensorflow) as TF_HOME.

Comments for TensorFlow installation steps:

After cloning and going into the tensorflow(by default) directory, checkout tor1.2 branch:
```
git checkout r1.2
```
If you have built TF with python 3.5, it might be useful to run pip installation of TF wheel with specifying that it is the pip of python version 3+.
E.g.
```
sudo pip3 install /tmp/tensorflow_pkg/tensorflow-1.2.1-cp35-cp35m-linux_x86_64.whl
```
If you decide to check TF installation leave tensorflow directory. Unless, you will most likely get error while importing TF from python console.

Build Graph Transform Tool, here are instructions.

Comments for Graph Transform Tool installation steps:

bazel build tensorflow/tools/graph_transforms:transform_graph

After that you need to go the ModelOptimizerForTensorFlow in MO folder. E.g.

cd IE_INSTALL/deployment_tools/model_optimizer/ModelOptimizerForTensorFlow

Run the configuration script with necessary parameters. For instance, it should look like:

python configure.py --framework_path TF_HOME

Note: Make sure that there are two dashes before the argument. Sometimes the page is rendered with single dash and you will get an error in terminal.

Install modeloptimizer package:

sudo python3 setup.py install

After you have configured MO on the machine, you need to convert existing Tensorflow model to the IR.

Convert Caffe model to IR

In this tutorial we will take already trained model. In particular, AlexNet from the Caffe ModelZoo.

We will refer to the directory with Caffe model as MODEL_DIR. It should contain a .prototxt (deploy-ready topology structure) and .caffemodel (trained weights) files.

To run MO, use tutorial.

Comments for MO running steps:

To just generate IR, you need to execute it only with options: precision, weights, deploy proto file and that it will be the input for IE (-i).
```
./ModelOptimizer -p FP32
-w MODEL_DIR/bvlc_alexnet.caffemodel
-d MODEL_DIR/deploy.prototxt -i -b 1
```

It should output something like "Writing binary data to: Artifacts/AlexNet/AlexNet.bin".

Meaning that AlexNet IR was generated and both (.xml - topology description and .bin - weights of the model) are placed in Artifacts.

After you have generated IR, let's run it via the appropriate sample. In this case, as AlexNet is a classification model, you need to refer to instructions for building IE samples and running the classification sample.

Convert TensorFlow model to IR

In this tutorial we will take already trained model. In particular, Inception V1 Model from TF ModelZoo.

We will refer to the directory with TF model as MODEL_DIR. It should contain a .pb - a binary protobuf file with deploy-ready topology structure and trained weights.

How to get model from the Model Zoo and convert it to IR

If you have already a model in a .pb format, skip this section and go directly to IR generation.

Originally, you get model checkpoint from the given link. Unzip it to the folder. We will refer to it as CKPT_DIR.

To get the desired .pb file, which is required to generate valid IR, you need to make the .pb file first.

To do that, go to the specific folder:

cd IE_INSTALL/deployment_tools/model_optimizer/ModelOptimizerForTensorFlow/modeloptimizer/frameworks/tensorflow

After that you need to run freeze_checkpoint.py:

python3 freeze_checkpoint.py --checkpoint CKPT_DIR/inception_v1.ckpt --output_pb / MODEL_DIR/inception.pb --output InceptionV1/Logits/Predictions/Reshape_1 --net inception_v1

If there is no error in terminal, you will find the .pb file in a MODEL_DIR.

Now move back to the root folder:

cd IE_INSTALL/deployment_tools/model_optimizer/ModelOptimizerForTensorFlow

To run MO, use tutorial.

Comments for MO running steps:

When specifying a path to the protobuf file, use absolute paths only. Relative ones like ~/models/inception woould not work for you.

To just generate IR, you need first to know the input and the outpuit layers names of the model you are going to convert. If you don't know that, you can use summarize_graph util from the TF, otherwise skip this step. To use it:

2.1 Go to TF directory. E.g.:

cd TF_HOME

2.2 Build it and run, using the instruction.

In the output, you will get something like:

Found 1 possible inputs: (name=Placeholder, type=float(1), shape=[1,224,224,3])
No variables spotted.
Found 1 possible outputs: (name=InceptionV1/Logits/Predictions/Reshape_1, op=Reshape)
Found 6633279 (6.63M) const parameters, 0 (0) variable parameters, and 0 control_edges
Op types used: 298 Const, 231 Identity, 114 Add, 114 Mul, 58 Conv2D, 57 Relu, 57 Rsqrt, 57 Sub, 13 MaxPool, 9 ConcatV2, 2 Reshape, 1 AvgPool, 1 BiasAdd, 1 Placeholder, 1 Softmax, 1 Squeeze

In this case input layer name is input, output layer is InceptionV3/Predictions/Reshape_1.

To generate IR, you need execute MO with necessary options: path to protobuf model, output path, graph input file, input layer name, output layer name.

E.g.:

python3 modeloptimizer/scripts/model_optimizer.py \
--input_model=MODEL_DIR/inception.pb \
--output_model=IE_INSTALL/deployment_tools/model_optimizer/ModelOptimizerForTensorFlow/gen/v1.pb \
--input=Placeholder \
--output=InceptionV1/Logits/Predictions/Reshape_1 \
--transforms="\
strip_unused_nodes(type=float;shape=1,224,224,3) \
remove_nodes(op=Identity) \
remove_nodes(op=CheckNumerics) \
fold_constants(ignore_errors=true) \
fold_batch_norms \
strip_unused_nodes(type=float; shape=1,224,224,3) \
remove_nodes(op=Identity) \
remove_nodes(op=CheckNumerics) \
fold_constants(ignore_errors=true) \
fold_batch_norms \
strip_unused_nodes(type=float; shape=1,224,224,3) \
remove_nodes(op=Identity) \
remove_nodes(op=CheckNumerics) \
fold_constants(ignore_errors=true) \
fold_batch_norms  \
calc_shapes(input_types=float; input_shapes=1,224,224,3) \
create_ir(model_name=v1; output_dir=gen)"

It is very important to set the name of the input and output layer in --input and--output argument respectively.

Note that, this execution influences the original model, so make sure you have made a copy of it in advance. If execution of IR generation fails, it would be better to use the clean model copy.

If you don't see any error message, it means that Inception V1 IR was generated and both (.xml - topology description and .bin - weights of the model) are placed in the folder with output model that you specified in the output_model property when running MO (step 3). In this tutorial it is /home/temp/inception.

After you have generated IR, let's run it via the appropriate sample. In this case, as Inception V1 is a classification model, you need to refer to instructions for building IE samples and then running classification sample.

Build IE samples

General info about IE is here.

More specifically samples information and description can be found here.

Follow these instructions and build samples.

Comments for IE samples building steps:

From the build folder in samples directory:
```
cmake -DCMAKE_BUILD_TYPE=Release ..
make
```

After that you need to go to the applications binaries:

cd IE_INSTALL/deployment_tools/inference_engine/bin/intel64/Release

You will find there binaries for all samples in IE.

IE provides an opportunity to infer the IR on a specific device. Currently, it supports CPU and GPU.

To make the IE recognize required libraries availability, use the setvars.sh script, which will set all necessary environment variables.

What is left is running the required sample with appropriate commands, providing IR information.

Run IE classification sample

If your model is a classification one, it should be infered using the classification sample of IE. To run it, you can refer to the following instructions page.

We will refer to the directory which contains IR (.xml and .bin files) as IR_DIR.

For that, run (expect that you are in a Release folder):

source ../../setvars.sh

For IR you have generated with MO, the command to run inference can be the following:

./classification_sample \
-i ~/Downloads/cat3.bmp \
-m IR_DIR/AlexNet.xml \
-d CPU

Please note, that it IE assumes that weights are in the same folder as .xml file.

You will see top-10 predictions output for the given image.

↧

What's new? Intel® SDK for OpenCL™ Applications 2017, R1

August 11, 2017, 1:34 pm

Latest and popular articles on Intel Technologies

≫ Next: Introduction to Programming with Persistent Memory from Intel

≪ Previous: Inference of Caffe* and TensorFlow* Trained Models with Intel’s Deep Learning Deployment Toolkit

The following features are added in the 2017 R1 release.

Microsoft Visual Studio* 2017 Support
Eclipse* Oxygen (4.7) and Neon (4.6) IDEs Support
New Operating Systems Support:

Microsoft Windows* 10 Creator Update support including full compatibility with latest Intel Graphics driver (15.46)
Ubuntu* 16.04 support including full compatibility with latest OpenCL™ 2.0 CPU/GPU driver package for Linux* OS (SRB5)
CentOS* 7.3 support

Enhanced tools support for 6th and 7th Generation Intel® Core™ Processors on Microsoft Windows* and Linux* operating systems

Usability enhancements and bug fixes

Improved OpenCL™ 2.1 and SPIR-V* support on Linux* OS

OpenCL 2.1 development environment with the experimental CPU-only runtime for OpenCL 2.1
SPIR-V generation support with Intel® Code Builder for OpenCL™ offline compiler and Kernel Development Framework including textual representation of SPIR-V binaries

New features in Kernel Development Framework

Workflow support allowing build, execution and analysis of applications with multiple kernels
Build from binary to reduce compilation time for complex kernels
Latency analysis on 6th and 7th Generation Intel® Core™ Processors

Intel® SDK for OpenCL™ Applications 2017 includes all the features for OpenCL development for Windows* previously available in Intel® SDK for OpenCL™ Application 2016 R3 and all features for Linux* development which available in Code Builder for Intel® Media Server Studio.

↧