Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1142 articles
Browse latest View live

Performance Optimization of memcpy in DPDK

$
0
0

Introduction

Memory copy, memcpy, is a simple yet diverse operation, as there are possibly hundreds of code implementations that copy data from one part of memory to another. However, the discussion on how to evaluate and optimize for a better memcpy never stops.

This article discusses how optimizations are positioned, conducted, and evaluated for use with memcpy in the Data Plane Development Kit (DPDK).

First, let’s look at the following simple memcpy function:

void * simple_memcpy(void *dst, const void *src, size_t n)
{
        const uint8_t *_src = src;
        uint8_t *_dst = dst;
        size_t i;

        for (i = 0; i < n; ++i)
                _dst[i] = _src[i];

        return dst;
}

Is there anything wrong with this function? Not really. But it surely missed some optimization methods. The function:

  • Does not employ single instruction, multiple data (SIMD)
  • Has no instruction-level parallelism
  • Lacks load/store address alignment

The performance of the above implementation depends entirely on the compiler’s optimization. Surprisingly, in some scenarios, this function outperforms the glibc memcpy. Of course, the compiler takes most of the credit by optimizing the implementation. But it also gets us thinking: Is there an ultimate memcpy implementation that outperforms all others?

This article holds the view that the ultimate memcpy implementation, providing the best performance in any given scenario (hardware + software + data) simply does not exist. Ironically, the best memcpy implementation is to completely avoid memcpy operations; the second-best implementation might be to handcraft dedicated code for each and every memcpy call, and there are others. Memcpy should not be considered and measured as one standalone part of the program; instead, the program should be seen as a whole—the data that one memcpy accesses has been and will be accessed by other parts of the program, also the instructions from memcpy and other parts of the program interact inside the CPU pipeline in an out-of-order manner. This is why DPDK introduced rte_memcpy, to accelerate the critical memcpy paths in core DPDK scenarios.

Common Optimization Methods for memcpy

There are abundant materials online for memcpy optimization; we provide only a brief summary of optimization methods here.

Generally speaking, memcpy spends CPU cycles on:

  1. Data load/store
  2. Additional calculation tasks (such as address alignment processing)
  3. Branch prediction

Common optimization directions for memcpy:

  1. Maximize memory/cache bandwidth (vector instruction, instruction-level parallel)
  2. Load/store address alignment
  3. Batched sequential access
  4. Use non-temporal access instruction as appropriate
  5. Use the String instruction as appropriate to speed up larger copies

Most importantly, all instructions are executed through the CPU pipeline; therefore, pipeline efficiency is everything, and the instruction flow needs to be optimized to avoid pipeline stall.

Optimizing Memcpy for DPDK

Since early 2015, the exclusive memcpy implementation for DPDK, rte_memcpy, has been optimized several times to accelerate different DPDK use-case scenarios, such as vhost Rx/Tx. All the analysis and code changes can be viewed in the DPDK git log.

There are many ways to optimize an implementation. The simplest and most straightforward way is trial and error; to make a variety of improvements with some baseline knowledge, verify them in the target scenario, and then choose a better one by using a set of evaluation criteria. All you need is experience, patience, and a little imagination. Although this approach can sometimes bring surprises, it is neither efficient nor reassuring.

Another common approach sounds more promising: At first, initial effort is invested to fully understand the source code (assembly code, if necessary) behaviors and to establish the theoretical optimal performance. With this optimal baseline, the performance gap can be confirmed. Runtime sampling is then conducted to analyze defects of the existing code to seek improvement. This may require a lot of experience and analysis effort. For example, the vhost enqueue optimization in DPDK 16.11 is the result of several weeks work spent sampling and analyzing. Finally, by moving three lines of code, tests performed with DPDK testpmd showed that enqueue efficiency was improved by 1.7 times as the enqueue cost is reduced from about 250 cycles per packet to 150 cycles per packet. Later, in DPDK 17.02, the rte_memcpy optimization patch was derived from the same idea. This is hard to achieve by the first method.

See Appendix A for a description of the test hardware configuration we used for testing. To learn more about DPDK performance testing with testpmd, read Testing DPDK Performance and Features with TestPMD on Intel® Developer Zone.

There are many useful tools for profiling and sampling such as perf and VTune™. They are very effective as long as you know what data you’re looking for.

Show Me the Data!

Ultimately, the goal of optimization is to speed up the performance of the target application scenario, which is a combination of hardware, software, and data. The evaluation methods vary.

For memcpy, the use of a micro-benchmark can easily get a few key performance numbers such as the copy rate (MB/s); however, that approach lacks reference values. That’s because memcpy is normally optimized at the programming language level as well as the instruction level for the specific hardware platform and specific software code, even specific data length; and the memcpy algorithm itself doesn’t have much space for improvement. In this case, different scenarios require different optimization techniques. Therefore, micro-benchmarks speak only for themselves.

Also, it is not advisable to evaluate performance by timestamping the memcpy code. The modern CPU has a very complex pipeline that supports prefetching and out-of-order execution, which results in significant deviations when the performance is measured at the cycle level. Although forced synchronization can be done by adding serialization instructions, it may change the execution sequence of instruction flow, degrade program performance, and breach the original intention of performance measurement. Meanwhile, instructions which are highly optimized by the compiler also appear to be out-of-order with respect to the programming language. Forced sequential compiling also significantly impacts performance and makes the result meaningless. Besides, the execution time of an instruction stream includes not only the ideal execution cycles, but also the data access latency caused by pipeline stall cycles. Since the data accessed by a piece of code probably has been and will be accessed by other parts of the program, it may appear to have a shorter execution time by advancing or delaying the data access. These complex factors make the seemingly easy task of memcpy performance evaluation troublesome.

Therefore, field testing should be used for optimization evaluation. For example, in the application of Open vSwitch* (OvS) in the cloud, memcpy is heavily used in vhost Rx/Tx, and in this case we should take the final packet forwarding rate as the performance evaluation criteria for the memcpy optimization.

Test Results

Figure 1 below shows the test results of an example with Physical-VM-Physical (PVP) traffic close to the actual application scenario. And by replacing rte_memcpy in DPDK vhost with memcpy provided by glibc, comparative data is gained. The results show that an increase of 22 percent of the total bandwidth can be obtained simply by accelerating the vhost Rx/Tx part by the use of rte_memcpy. Our test configuration is described below in Appendix A.

 colored column show comparisons of performances in DPDK rte_memcpy and glibc memcpy in OvS-DPDK
Figure 1. Performance comparison between DPDK rte_memcpy and glibc memcpy in OvS-DPDK

 

Continue the Conversation

Join the DPDK mailing list, dev@dpdk.org, where your feedback and questions about rte_memcpy are welcomed.

About the Author

Zhihong Wang is a software engineer at Intel. He has worked on various areas, including CPU performance benchmarking and analysis, packet processing performance optimization, network virtualization.

Appendix A

Test Environment

  • PVP flow: Ixia* sends the packet to the physical network card, OvS-DPDK forwards the packet received by the physical network card to the virtual machine, the virtual machine processes the packet and sends it back to the physical network card by the OvS-DPDK, and finally back to Ixia
  • Virtual machine will use MAC-forwarding using DPDK testpmd
  • OvS-DPDK Version: Commit f56f0b73b67226a18f97be2198c0952dad534f1c
  • DPDK Version: 17.02
  • GCC/GLIBC Version: 6.2.1/2.23
  • Linux*: 4.7.5-200.fc24.x86_64
  • CPU: Intel® Xeon® processor E5-2699 v3 at 2.30GHz

OvS-DPDK Compile and Boot Commands

./ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
./ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
./ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev
./ovs-vsctl add-port ovsbr0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser
./ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:06:00.0
./ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10000
./ovs-ofctl del-flows ovsbr0
./ovs-ofctl add-flow ovsbr0 in_port=1,action=output:2
./ovs-ofctl add-flow ovsbr0 in_port=2,action=output:1

Use DPDK testpmd for Virtual Machine Forwarding

set fwd mac
start


Configure SR-IOV Network Virtual Functions in Linux* KVM*

$
0
0

Introduction

This tutorial demonstrates several different ways of using single root input/output virtualization (SR-IOV) network virtual functions (VFs) in Linux* KVM* virtual machines (VMs) and discusses the pros and cons of each method.

Here’s the short story: use the KVM virtual network pool of SR-IOV adapters method. It has the same performance as the VF PCI* passthrough method, but it’s much easier to set up. If you must use the macvtap method, use virtio as your device model because every other option will give you horrible performance. And finally, if you are using a 40 Gbps Intel® Ethernet Server Adapter XL710, consider using the Data Plane Development Kit (DPDK) in the guest; otherwise you won’t be able to take full advantage of the 40 Gbps connection.

There are a few downloads associated with this tutorial that you can get from github.com/intel:

SR-IOV Basics

SR-IOV provides the ability to partition a single physical PCI resource into virtual PCI functions which can then be injected into a VM. In the case of network VFs, SR-IOV improves north-south network performance (that is, traffic with endpoints outside the host machine) by allowing traffic to bypass the host machine’s network stack. 

Supported Intel Network Interface Cards

A complete list of Intel Ethernet Server Adapters and Intel® Ethernet Controllers that support SR-IOV is available online, but in this tutorial, I evaluated just four: 

  • The Intel® Ethernet Server Adapter X710, which supports up to 32 VFs per port
  • The Intel Ethernet Server Adapter XL710, which supports up to 64 VFs per port 
  • The Intel® Ethernet Controller X540-AT2, which supports 32 VFs per port 
  • The Intel® Ethernet Controller 10 Gigabit 82599EB, which supports 32 VFs per port

Assumptions

There are several different ways to inject an SR-IOV network VF into a Linux KVM VM. This tutorial evaluates three of those ways:

  • As an SR-IOV VF PCI passthrough device
  • As an SR-IOV VF network adapter using macvtap 
  • As an SR-IOV VF network adapter using a KVM virtual network pool of adapters

Most of the steps in this tutorial can be done using either the command line virsh tool or using the virt-manager GUI. If you prefer to use the GUI, you’ll find screenshots to guide you; if you are partial to the command line, you’ll find code and XML snippets to help. Note that there are several steps in this tutorial that cannot be done via the GUI. 

Network Configuration

The test setup included two physical servers—net2s22c05 and net2s18c03—and one VM—sr-iov-vf-testvm—that was hosted on net2s22c05. Net2s22C05 had one each of the four Intel Ethernet Server Adapters listed above with one port in each adapter directly linked to a NIC port with equivalent link speed in net2s18c03. The NIC ports on each system were in the same subnet: those on net2s18c03 all had static IP addresses with .1 as the final dotted quad, the net2s22c05 ports had .2 as the final dotted quad, and the virtual ports in sr-iov-vf-testvm all had .3 as the final dotted quad:  

Network Configuration

System Configuration

Host Configuration

CPU2-Socket, 22-core Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz 
Memory128 GB
NICIntel® Ethernet Controller X540-AT2
 Intel® 82599 10 Gigabit TN Network Connection
 Intel® Ethernet Controller X710 for 10GbE SFP+
 Intel® Ethernet Controller XL710 for 40GbE QSFP+
Operating SystemUbuntu* 16.04 LTS
Kernel parametersGRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt”

Guest Configuration

The following XML snippets are taken via # virsh dumpxml sr-iov-vf-testvm

CPU
<vcpu placement='static'>8</vcpu><cpu mode='host-passthrough'><topology sockets='1' cores='8' threads='1'/></cpu>
Memory
<memory unit='KiB'>12582912</memory><currentMemory unit='KiB'>12582912</currentMemory>
NIC
<interface type='network'><mac address='52:54:00:4d:2a:82'/><source network='default'/><model type='rtl8139'/><address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/></interface>
 The SR-IOV NIC XML tag varied based the configurations discussed in this tutorial.
Operating SystemUbuntu 14.04 LTS. Note: This OS and Linux* kernel version were chosen based on a specific usage. Otherwise, newer versions would have been used. 
Linux* Kernel Version3.13.0-24-lowlatency
Software
  • ufw purged
  • lshw installed

Note: Ubuntu 14.04 LTS did not come with the i40evf driver preinstalled. I built the driver from source and then loaded it into the kernel. I used version 2.0.22. Instructions for building and loading the driver are located in the README file.

The complete KVM definition file is available online.

Scope

This tutorial does not focus on performance. And even though the performance of the Intel Ethernet Server Adapter XL710 SR-IOV connection listed below clearly demonstrates the value of the DPDK, this tutorial does not focus on configuring SR-IOV VF network adapters to use DPDK in the guest VM environment. For more information on this topic, see the Single Root IO Virtualization and Open vSwitch Hands-On Lab Tutorials. You can find detailed instructions on how to set up SR-IOV VFs on the host in this SR-IOV Configuration Guide and the video Creating Virtual Functions using SR-IOV. But to get you started, once you have enabled iommu=pt and intel_iommu=on as kernel boot parameters, if you are running a Linux kernel that is at least 3.8.x, to initialize SR-IOV VFs issue the following command:

     #echo 4 > /sys/class/net/<device name>/device/sriov_numvfs

Once an SR-IOV NIC VF is created on the host, the driver/OS assigns a MAC address and creates a network interface for the VF adapter.

Parameters

When evaluating the advantages and disadvantages of each insertion method, I looked at the following:

  • Host device model
  • PCI device information as reported in the VM
  • Link speed as reported by the VM
  • Host driver
  • Guest driver 
  • Simple performance characteristics using iperf
  • Ease of setup

Host Device Model

This is the device type that is specified when the SR-IOV network adapter is inserted into the VM. In the virt-manager GUI, the following typical options are available:

  • Hypervisor default (which in our configuration defaulted to rtl8139
  • rtl8139
  • e1000
  • virtio

Additional options were available on our test host machine, but they had to be entered into the VM XML definition using # virsh edit. I additionally evaluated the following:

  • ixgbe
  • i82559er

VM Link Speed

I evaluated link speed of the SR-IOV VF network adapter in the VM using the following command:

     # ethtool eth1 | grep Speed

VM Link Speed

Host Network Driver

This is the driver that the KVM Virtual Machine Manager (VMM) uses for the NIC as displayed in the <driver> XML tag when I ran the following command on the host after starting the VM: 

     # virsh dumpxml sr-iov-vf-testvm | grep -w hostdev -A9

Host Network Driver

Guest Network Driver

This is the driver that the VM uses for the NIC. I found the information by first determining the SR-IOV NIC PCI interface information in the VM:

     # lshw -c network –businfo

Guest Network Driver

Using this PCI bus information, I then ran the following command to find what driver the VM had loaded into the kernel for the SR-IOV NIC:

     # lspci -vmmks 00:03.0

Guest Network Driver

Performance 

Because this is not a performance-oriented paper, this data is provided only to give a rough idea of the performance of different configurations. The command I ran on the server system was 

     # iperf -s -f m

Performance

And the client command was: 

     # iperf -c <server ip address> -f m -P 2

Performance

I only did one run with the test VM as the server and one run with the test VM as a client.

Ease of Setup

This is an admittedly subjective evaluation parameter. But I think you’ll agree that there was a clear loser: the option of inserting the SR-IOV VF as a PCI passthrough device.

SR-IOV Virtual Function PCI Passthrough Device 

The most basic way to connect an SR-IOV VF to a KVM VM is by directly importing the VF as a PCI device using the PCI bus information that the host OS assigned to it when it was created. 

Using the Command Line

Once the VF has been created, the network adapter driver automatically creates the infrastructure necessary to use it. 

Step 1: Find the VF PCI bus information. 

In order to find the PCI bus information for the VF, you need to know how to identify it, and sometimes the interface name that is assigned to the VF seems arbitrary. For example, in the following figure there are two VFs and the PCI bus information is outlined in red, but it is impossible to determine from this information which physical port the VFs are associated with.

     # lshw -c network -businfo

Find the VF PCI bus information.

The following bash script lists all the VFs associated with a physical function.

#!/bin/bash

NIC_DIR="/sys/class/net"
for i in $( ls $NIC_DIR) ;
do
	if [ -d "${NIC_DIR}/$i/device" -a ! -L "${NIC_DIR}/$i/device/physfn" ]; then
		declare -a VF_PCI_BDF
		declare -a VF_INTERFACE
		k=0
		for j in $( ls "${NIC_DIR}/$i/device" ) ;
		do
			if [[ "$j" == "virtfn"* ]]; then
				VF_PCI=$( readlink "${NIC_DIR}/$i/device/$j" | cut -d '/' -f2 )
				VF_PCI_BDF[$k]=$VF_PCI
				#get the interface name for the VF at this PCI Address
				for iface in $( ls $NIC_DIR );
				do
					link_dir=$( readlink ${NIC_DIR}/$iface )
					if [[ "$link_dir" == *"$VF_PCI"* ]]; then
						VF_INTERFACE[$k]=$iface
					fi
				done
				((k++))
			fi
		done
		NUM_VFs=${#VF_PCI_BDF[@]}
		if [[ $NUM_VFs -gt 0 ]]; then
			#get the PF Device Description
			PF_PCI=$( readlink "${NIC_DIR}/$i/device" | cut -d '/' -f4 )
			PF_VENDOR=$( lspci -vmmks $PF_PCI | grep ^Vendor | cut -f2)
			PF_NAME=$( lspci -vmmks $PF_PCI | grep ^Device | cut -f2).
			echo "Virtual Functions on $PF_VENDOR $PF_NAME ($i):"
			echo -e "PCI BDF\t\tInterface"
			echo -e "=======\t\t========="
			for (( l = 0; l < $NUM_VFs; l++ )) ;
			do
				echo -e "${VF_PCI_BDF[$l]}\t${VF_INTERFACE[$l]}"
			done
			unset VF_PCI_BDF
			unset VF_INTERFACE
			echo ""
		fi
	fi
done

With the PCI bus information from this script, I imported a VF from the first port on my Intel Ethernet Controller X540-AT2 as a PCI passthrough device.

PCI passthrough device

Step 2: Add a hostdev tag to the VM.

Using the command line, use # virsh edit <VM name> to add a hostdev XML tag to the machine. Use the host machine PCI Bus, Domain, and Function information from the bash script above for the source tag’s address domain, bus, slot, and function attributes.

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
<hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0000' bus='0x03' slot='0x10' function='0x0'/></source></hostdev>
…
</devices>
…
</domain>

Once you exit the virsh edit command, KVM automatically adds an additional <address> tag to the hostdev tag to allocate the PCI bus address in the VM.

Step 3: Start the VM.

     # virsh start <name of virtual machine>

Start the VM.

Using the GUI

Note: I have not found an elegant way to discover the SR-IOV PCI bus information using graphical tools. 

Step 1: Find the VF PCI bus information.

See the commands from Step 1 above. 

Step 2: Add a PCI host device to the VM.

Once you have the host PCI bus information for the VF, using the virt-manager GUI, click Add Hardware.

 Add a PCI host device to the VM.

After selecting PCI Host Device, you’ll see an array of PCI devices shown that can be imported into our VM.

 Add a PCI host device to the VM.

Give keyboard focus to the Host Device drop-down list, and then start typing the PCI Bus Device Function information from the bash script above, substituting a colon for the period (‘03:10:0’ in this case). After the desired VF comes into focus, click Finish.

 Add a PCI host device to the VM.

The PCI device just imported now shows up in the VM list of devices.

Add a PCI host device to the VM.

Step 3: Start the VM.

Add a PCI host device to the VM.

Summary

When using this method of directly inserting the PCI host device into the VM, there is no ability to change the host device model: for all NIC models, the host used the vfio driver. The Intel Ethernet Servers Adapters XL710 and X710 adapters used the i40evf driver in the guest, and for both, the VM PCI Device information reported the adapter name as “XL710/X710 Virtual Function.” The Intel Ethernet Controller X540-AT2 and Intel 82599 10 Gigabit Ethernet Controller adapters used the ixgbevf driver in the guest, and the VM PCI device information reported “X540 Ethernet Controller Virtual Function” and “82599 Ethernet Controller Virtual Function” respectively. With the exception of the XL710, which showed a link speed of 40 Gbps, all 10 GB adapters showed a link speed of 10 Gbps. For the X540, 82599, and X710 adapters, the iperf test ran at nearly line rate (~9.4 Gbps), and performance was roughly ~8 percent worse when the VM was the iperf server versus when the VM was the iperf client. While the XL710 performed better than the 10 Gb NICs, it performed at roughly 70 percent line rate when the iperf server ran on the VM, and at roughly 40 percent line rate when the iperf client was on the VM. This disparity is most likely due to the kernel being overwhelmed by the high rate of I/O interrupts, a problem that would be solved by using DPDK.

The one advantage to this method is that it allows control over which VF is inserted into the VM, whereas the virtual network pool of adapters method does not. This method of injecting an SR-IOV VF network adapter into a KVM VM is the most complex to set up and provides the fewest host device model options. Performance is not significantly different than the method that involves a KVM virtual network pool of adapters. However, that method is much simpler to use. Unless you need control over which VF is inserted into your VM, I don’t recommend using this method.

SR-IOV Network Adapter Macvtap 

The next way to add an SR-IOV network adapter to a KVM VM is as a VF network adapter connected to a macvtap on the host. Unlike the previous method, this method does not require you to know the PCI bus information for the VF, but you do need to know the name of the interface that the OS created for the VF when it was created.

Using the command-line

Much of this method of connecting an SR-IOV VF to a VM can be done via the virt-manager GUI, but step 1 must be done using the command line.

Step 1: Determine the VF interface name 

As shown in the following figure, after creating the VF, use the bash script listed above to display the network interface names and PCI bus information assigned to the VFs.

Determine the VF interface name

With this information, insert the VFs into your KVM VM using either the virt-manager GUI or the virsh command line.

Step 2:  Add an interface tag to the VM.

To use the command-line with the macvtap adapter solution, with the VM shut off, edit the VM configuration file and add an ‘interface’ tag with sub-elements and attributes shown below. The interface ‘type’ is ‘direct’, and the ‘dev’ attribute of the ‘source’ sub-element must point to the interface name that the host OS assigned to the target VF. Be sure to specify the ‘mode’ attribute of the ‘source’ element as ‘passthrough’:

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
   <interface type='direct'><source dev='enp3s16f1' mode='passthrough'/></interface>
…
</devices>
…
</domain>

Once the editor is closed, KVM automatically assigns a MAC address to the SR-IOV interface, uses the default model type value of rtl8139, and assigns the NIC a slot on the VM’s PCI bus. 

Step 3: Start the VM.

 Start the VM.

As the VM starts, KVM creates a macvtap adapter ‘macvtap0’ on the VF specified. On the host, you can see that the macvtap adapter that KVM created for your VF NIC uses a MAC address that is different than the MAC address on the other end of the macvtap in the VM:

     # ip l | grep enp3s16f1 -A1

 Start the VM.

The fact that there are two MAC addresses assigned to the same VF—one by the host OS and one by the VM—suggests that the network stack using this configuration is more complex and likely slower.

Using the GUI

With the exception of determining the interface name of the desired VF, all the steps of this method can be done using the virt-manager GUI.

Step 1: Determine the VF interface name.

See the command line Step 1 above.

Step 2: Add the SR-IOV macvtap adapter to the VM.

Using virt-manager, add hardware to the VM.

 Add the SR-IOV macvtap adapter to the VM.

Select Network as the type of device.

 Add the SR-IOV macvtap adapter to the VM.

For the Network source, choose the Host device <interface name>: macvtap line from the drop-down control, substituting for “interface name” the interface that the OS assigned to the VF created earlier.  

 Add the SR-IOV macvtap adapter to the VM.

Note virt-manager’s warning about communication with the host using macvtap VFs.

 Add the SR-IOV macvtap adapter to the VM.

Ignore this warning and choose Passthrough in the Source mode drop-down control.

 Add the SR-IOV macvtap adapter to the VM.

Note that virt-manager assigns a MAC address to the macvtap VF that is NOT the same address as the host OS assigned to the SR-IOV VF.

 Add the SR-IOV macvtap adapter to the VM.

Finally, click Finish.

Step 3: Start the VM.

 Add the SR-IOV macvtap adapter to the VM.

Summary

When using the macvtap method of connecting an SR-IOV VF to a VM, the host device model had a dramatic effect on all parameters, and there was no host driver information listed regardless of configuration. Unlike the other two methods, it was impossible to tell using the VM PCI device information the model of the underlying VF. Like the direct VF PCI passthrough insertion option, this method allows you to control which VF you wish to use. Regardless of which VF was connected, when the host device model was rtl8139 (the hypervisor default in this case), the guest driver was 8139cp, the link speed was 100 Mbps, and performance was roughly 850 Mbps. When e1000 was selected as the host device model, the guest driver was e1000, the link speed was 1 Gbps, and iperf ran at 2.1 Gbps with the client on the VM, and 3.9 Gbps with the client on the server. When the VM XML file was edited so that ixgbe was the host device model, the VM failed to boot. When the host device model tag in the VM XML was set to i82559er, the guest VM used the e100 driver for the VF, link speed was 100 Mbps, iperf ran at 800 Mbps when the server was on the VM, and 10 Mbps when the client was on the VM. Selecting virtio as the host device model clearly provided the best performance. No link speed was listed in that configuration, the VM used the virtio-pci driver, and iperf performance was roughly line rate for the 10 Gbps adapters. When the Intel Ethernet Server Adapter XL710 VF was inserted into the VM using the macvtap, with the client on the VM, performance was ~40 percent line rate, similar to the other insertion methods; however performance with the server on the VM was significantly worse than the other insertion methods: ~40 percent line rate versus ~70 percent line rate.

The method of inserting an SR-IOV VF network device into a KVM VM via a macvtap is simpler to set up than the option of directly importing the VF as a PCI device. However, the connection performance varies by a factor of 100 depending on which host device model is selected. In fact the default device model for both command line and GUI is rtl8139, which performs 10x slower than virtio, which is the most ideal option. And if the i82559er host device model is specified using the KVM XML file, performance is 100x worse than virtio. If virtio is selected, the performance is similar to that in other methods of inserting the SR-IOV VF NIC mentioned here. If you must use this method of connecting the VF to a VM, be sure to use virtio as the host device model.

SR-IOV Virtual Network Adapter Pool 

The final method of using an SR-IOV VF NIC with KVM involves creating a virtual network based on the NIC PCI physical function. You don’t need to know PCI information as was the case with the first method, or VF interface names as was the case with the second method. All you need is the interface name of the physical function. Using this method, KVM creates a pool of network devices that can be inserted into VMs, and the size of that pool is determined by how many VFs were created on the physical function when they were initialized.

Using the Command Line

Step 1: Create the SR-IOV virtual network pool. 

Once the SR-IOV VFs have been created, use them to create a virtual network pool of network adapters. List physical network adapters that have VFs defined. You can identify them with the lines that begin ‘vf’:

     # ip l

 Create the SR-IOV virtual network pool.

Make an XML file (sr-iov-net-XL710.xml in the code snippet below) that contains an XML element using the following template, and then substitute for ‘ens802f0’ the interface name of the physical function used to create your VFs and a name of your choosing for ‘sr-iov-net-40G-XL710’:

# cat > sr-iov-net-XL710.xml << EOF> <network>>  <name>sr-iov-net-40G-XL710</name>>  <forward mode='hostdev' managed='yes'>>   <pf dev='ens802f0'/>>  </forward>> </network>> EOF

Once this XML file has been created, use it with virsh to create a virtual network:

     # virsh net-define sr-iov-net-XL710.xml

Step 2: Display all virtual networks.

To make sure the network was created, use the following command:

# virsh net-list --all
 Name                 State      Autostart     Persistent
----------------------------------------------------------
 default              active     yes           yes
 sr-iov-net-40G-XL710 inactive   no            yes

Step 3: Start the virtual network.

The following command instructs KVM to start the network just created. Note that the name of the network (sr-iov-net-40G-XL710) comes from the name XML tag in the snippet above.

     # virsh net-start sr-iov-net-40G-XL710

Step 4: Autostart the virtual network.

If you want to have the network automatically start when the host machine boots, make sure that the VFs get created at boot, and then:

     # virsh net-autostart sr-iov-net-40G-XL710

Step 5: Insert a NIC from the VF pool into the VM.

Once this SR-IOV VF network has been defined and started, insert an adapter on that network into the VM while it is stopped. Use virsh-edit to add a network adapter XML tag to the machine that has as its source network the name of the virtual network, remembering to substitute the name of your SR-IOV virtual network for the ‘sr-iov-net-40G-XL710’ label.

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
<interface type='network'><source network='sr-iov-net-40G-XL710'/></interface>
…
</devices>
…
</domain>

Step 6: Start the VM.

     # virsh start <name of virtual machine>

 Start the VM.

Using the GUI

Step 1: Create the SR-IOV virtual network pool.

I haven’t been able to find a good way to create an SR-IOV virtual network pool using the virt-manager GUI because the only forward mode options in the GUI are “NAT” and “Routed.” The required forward mode of “hostdev” is not an option in the GUI. See Step 1 above.

Step 2: Display all virtual networks.

Using the virt-manager GUI, edit the VM connection details to view the virtual networks on the host.

 Display all virtual networks.

The virtual network created in step 1 appears in the list.

 Display all virtual networks.

Step 3: Start the virtual network.

To start the network, select it on the left, and then click the green “play” icon.

 Start the virtual network.

Step 4: Autostart the virtual network.

To autostart the network when the host machine boots, select the Autostart box so that the text changes from Never to On Boot. (Note: this will fail if you also don’t automatically allocate the SR-IOV VFs at boot.)

 Autostart the virtual network.

Step 5: Insert a NIC from the VF pool into the VM.

Open the VM.

 Insert a NIC from the VF pool into the VM.

Click the information button (“I”) icon, and then click Add Hardware.

 Insert a NIC from the VF pool into the VM.

On the left side, click Network to add a network adapter to the VM.

 Insert a NIC from the VF pool into the VM.

Then select Virtual network ‘<your virtual network name>’: Hostdev network as the Network source, allow virt-manager to select a MAC address, and leave the Device model as Hypervisor default.

 Insert a NIC from the VF pool into the VM.

Click Finish. The new NIC appears in the list of VM hardware with “rtl8139” as the device model.

 Insert a NIC from the VF pool into the VM.

Step 6: Start the VM.

 Start the VM.

Summary

When using the Network pool of SR-IOV VFs, selecting different host device models when inserting the NIC into the VM makes no difference as far as iperf performance, guest driver, VM link speed, or host driver were concerned. In all cases, the host used the vfio driver. The Intel Ethernet Server Adapters XL710 and X710 used the i40evf driver in the guest, and for both, the VM PCI Device information reported the adapter name as “XL710/X710 Virtual Function.” The Intel Ethernet Controller X540-AT and The Intel Ethernet Controller 10 Gigabit 82599EB used the ixgbevf driver in the guest, and the VM PCI device information reported “X540 Ethernet Controller Virtual Function” and “82599 Ethernet Controller Virtual Function” respectively. With the exception of the Intel Ethernet Server Adapter XL710, which showed a link speed of 40 Gbps, all 10 GB adapters showed a link speed of 10 Gbps. For the Intel Ethernet Controller X540, Intel Ethernet Controller 10 Gigabit 82599 and Intel Ethernet Server Adapter X710, the iperf test ran at nearly line rate (~9.4 Gbps), and performance was roughly ~8 percent worse when the VM was the iperf server versus when the VM was the iperf client. While the Intel Ethernet Server Adapter XL710 performed better than the 10 Gb NICs, it performed at roughly 70 percent line rate when the iperf server ran on the VM, and at roughly 40 percent line-rate when the iperf client was on the VM. This disparity is most likely due to the kernel being overwhelmed by the high rate of I/O interrupts, a problem that would be solved by using the DPDK.

In my opinion, this method of using the SR-IOV NIC is the easiest to set up, because the only information needed is the interface name of the NIC physical function—no PCI information and no VF interface names. And with default settings, the performance was equivalent to the VF PCI passthrough option. The primary disadvantage of this method is that you cannot select which VF you wish to insert into the VM because KVM manages it automatically, whereas with the other two insertion options you can select which VF to use. So unless this ability to select which VF to use is a requirement for you, this is clearly the best method.

Additional Findings

In every configuration, the test VM was able to communicate with both the host and with the external traffic generator, and the VM was able to continue communicating with the external traffic generator even when the host PF had no IP address assigned to it as long as the PF link state on the host remained up. Additionally, I found that when all 4 VFs were inserted simultaneously using the virtual network adapter pool method into the VM and iperf ran simultaneously on all 4 network connections, each connection still maintained the same performance as if run separately.

Conclusion

Using SR-IOV network adapter VFs in a VM can accelerate north-south network performance (that is, traffic with endpoints outside the host machine) by allowing traffic to bypass the host machine’s network stack. There are several ways to insert an SR-IOV NIC into a KVM VM using the command line and virt-manager, but using a virtual network pool of SR-IOV VFs is the simplest to set up and provides performance that is as good as the other methods. If you need to be able to select which VF to insert into the VM, the VF PCI passthrough option will likely be best for your. And if you must use the macvtap method, be sure to select ‘virtio’ as your host device type, otherwise your performance will be very poor. Additionally, if you are using an Intel Ethernet Controller XL710, consider using DPDK in the VM in order to take full advantage of the SR-IOV adapter’s speed.

About the Author

Clayne Robison is a Platform Application Engineer at Intel, working with software companies in Software Defined Networks and Network Function Virtualization. He lives in Mesa, Arizona, USA with his wife and the six of his eight children still at home. He’s a foodie that enjoys travelling around the world, always with his wife, and sometimes with the kids. When he has time to himself (which isn’t very often), he enjoys gardening and tending the citrus trees in his yard.

Resources

SR-IOV Configuration Guide: http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/xl710-sr-iov-config-guide-gbe-linux-brief.pdf

Creating Virtual Functions Using SR-IOV: http://software.intel.com/en-us/videos/creating-virtual-functions-using-sr-iov

FAQ for Intel® Ethernet Server Adapters with SR-IOV: http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000005722.html 

SR-IOV for NFV Solutions: http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/sr-iov-nfv-tech-brief.pdf 

SDN-NFV-Hands-on-Samples: https://github.com/intel/SDN-NFV-Hands-on-Samples 

Trion Worlds: Moving With the Times

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Trion Worlds: Moving with the times Get more game dev news and related topics from Intel on VentureBeat.

 Colorful, futuristic, 3d gaming plaza, populated with other colorful gaming characters.

In the big picture of global industry, ten years may not sound like a long time, but in the world of video games—especially online games—it’s a lifetime. Certainly, the past ten years since Trion Worlds* was founded has seen significant shifts in the tastes of gamers, resulting in the company evolving its own game designs and styles to match the emerging trends.

This evolution is reflected to a degree in the journey taken by CEO Scott Hartsman at the Redwood City, CA-based company. After joining initially as Executive Producer of Rift and overseeing the successful launch of the massively multiplayer online (MMO) game, he returned four years ago, as CEO. But Hartsman’s experience in the online PC game space goes back about as far as it’s possible to go in the industry, with stints designing and running games at early online companies such as ENGAGE games online*, Simutronics*, and then at Sony Online Entertainment* with its genre-changing EverQuest and EverQuest II.

It’s this deep experience throughout the fascinating evolution of online gaming that perfectly positions Hartsman to navigate rapidly changing gamer tastes. Casting back to 2007, Trion Worlds was founded with the goal of creating a technology platform on which massive online worlds could be built. At the same time, the company was also developing a game—Rift—to showcase the technology and provide its own shared world experience.

“Things people were aspiring to create were dramatically different from what people are trying to build today,” says Hartsman. “That’s due to changes in technology, people’s tastes, and business models. I guess you could call it Trion 2.0 at this point.”

The shift has seen Trion focus on games or, more specifically, on the tastes and interests of gamers. “We weren’t an engine shop for developers, and because our intended customers were gamers we had to start acting like a game company. So we started focusing more on the games we were creating and less on their underlying core tech ,” says Hartsman.

Creating a rift

Rift went on to be a big success in the MMO game space, which served to reinforce leaning towards game development over technology development. But even then, trends in online gaming showed rapid signs of shifting in a different direction from the formulae that had underpinned the design and creation of Rift.

“Going back to 2010 and 2011, gamers were exploring massive worlds for an average of four hours a day, which was very similar to the EverQuest and World of Warcraft* era,” says Hartsman.

 An animal that looks like an armored triceratops, rhinoceros mix is ridden by a warrior through a semi wild landscape.

Above: Rift required several years of development, but went on to be a successful massively multiplayer online game.

Hartsman identifies core adjustments not just in both gamer preferences, but in the fundamental accessibility to the Internet and interest in games. “The population of the Internet essentially has evolved to be everyone,” he says. “But tastes change and people are looking for different experiences.”

Part of the motivation of spending those daily hours in online worlds was the social connection made between gamers. Since interest in online gaming wasn’t universal, a willingness to play every day, learn deep game systems, and share experiences was limited, to a degree, to those that ‘got it.’

“Now, people bring their friends with them when they play games. Every game is an online game, and it’s not about coming in to make new friends,” he says. The shift in social dynamic has directly impacted the kind of game experiences embraced by a much wider audience.

“Gamers now play games in five- to 20-minute chunks that they can play over and over again. Look at Hearthstone, League of Legends, Dota, and Warframe, games that have the same depth of engagement but without massive synchronous worlds,” says Hartsman.

Reacting to the world

The change in tastes has led to Trion Worlds bringing its own competitive game to market in the form of Atlas Reactor. Emerging out of its own internal Trion Labs—where 35 initial one-page pitches were whittled down to 15 treatments and then down to three prototypes—a passionate team crafted its unique take on the crowded competitive gaming space.

 Colorful, futuristic, 3d characters engaged in battle

Above: “Constant iteration of builds is vital to producing the best possible game,” says Trion CEO Scott Hartsman.

“It’s more strategic than fast-action, and it’s set in an arena, but we didn’t want to be entry number eight in the lane-based MOBA (multiplayer online-battle arena) market. If you’re not in the top three, you might as well not exist,” says Hartsman on the importance of differentiating. For Atlas Reactor, it was important to find a niche to own and so, rather than challenge the fast-action lane games, the team crafted a simultaneous turn-based formula for its four-on-four competition. “It’s entertaining in the way Texas Hold ‘Em or American Football are entertaining,” says Hartsman. “You get a plan in your head and then everything happens at once as the plan plays out.”

For Trion Worlds, it’s refreshing to develop a game in a tighter, more focused environment given past experience building massive synchronous online worlds. “When you make an open world game, you can’t get a sense on whether it’s fun or cohesive until very late in the project,” says Hartsman. “You never get the vertical slice to see if it’s fun because it takes four years to get to that stage.”

“It’s bringing game development a little out of the wild west—like it was a decade ago—to something with more predictability and sanity,” says Hartsman, adding that the opportunity to constrain the scope of the project has the added benefit of “increasing the chance of actually shipping the product.”

Another key off-shoot of developing along a narrower vertical slice of gameplay is the ability to iterate constantly. Hartsman is keenly aware that despite tremendous hype, an open world game could launch and simply not function. “I have a difficult time being hard on developers in those situations,” he explains. “Because they are so complex, in many ways it’s a miracle that they work at all.”

Iterate, iterate, iterate

“The number of iterations you do on any given thing that you ship is directly proportional to the final quality,” asserts Hartsman. “It’s not all about time, resources, money, and head count … it’s all about the number of iterations, and the number of meaningful ones that you can do, and as soon as you can do them, the better off you’re going to be,” he adds.

 Colorful, futuristic, 3d characters engaged in battle

Above: Owning a niche was vital to Trion Worlds as the company moved into the competitive gaming space.

Despite a somewhat easier development process with iterative gameplay testing, the marketplace challenge is heightened due to the volume of competition and evolving audience expectations. “Quality goes up and player expectation with it,” says Hartsman of the current competitive environment, “but that’s what we want to focus on, improving quality rather than making wider and wider landscapes.”

As this swing in the development pendulum vogue hovers around the shorter experiences that have fueled the massive growth in MOBAs and mobile gaming, Hartsman relishes the opportunities still out there should it shift back to open, shared experiences. “I look forward to that again, it’s a personal passion. As a gamer, I enjoy playing those games, but the fact that we’re making these multi-strained games helps us streamline the quality,” he says.

Whatever direction the gaming industry wind is blowing, the lesson—or the message—is to stay nimble, and stay attuned to the changes in what the gamer wants to play. “People sticking around the last decade are in a constant state of reinvention,” says Hartsman. The only real certainty is that by the time your game is finished, another trend will have shifted the direction of consumer interest. Stay nimble.

Blowfish Bidding for the Big Time

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Blowfish bidding for the big time Get more game dev news and related topics from Intel on VentureBeat.

 Mech-Man, shooting away at opponents with guns blazing

The challenges facing fledgling studios stepping out in the competitive wide world of game development can be daunting when they’re known, terrifying when they are unseen, and plain baffling when the rules are changed on the fly.

For Blowfish Studios, founded by Ben Lee and Aaron Grove in 2010 in Sydney, Australia, witnessing the opening and closing of several studios led to defining their ambition as establishing a “sustainable” indie game studio. As many developers can attest, that is easier said than successfully accomplished.

“You can make quite a lot of money making mobile games and free-to-play, but that’s not really what we wanted to do,” Lee says. “We play all kinds of games, and while we’re 100 percent committed to keeping a sustainable balance between work-for-hire and making our own games, we’re really here to do the latter.”

The studio’s rookie project, Siegecraft*, laid the groundwork as a featured game release with the iPhone* 4S and 5, awareness that ultimately drove it to be a top gaming app on the iPad platform. Of course, for any studio starting out, following the core passion is vital to maintaining commitment to and enthusiasm for the games that require so much mental commitment, and so shifting to PC and beginning work on Gunscape was an important step.

Overseeing their own destiny, the Blowfish team could now focus their talents on building the game that would help boost the studio’s profile and that would make a statement in the competitive PC shooter market.

Landscape Gunscape

The team’s commitment to creating their chart-topping iOS title paved the way for the development of Gunscape, a shooter game that blends classic FPS settings and enemies with more recent (but retro-styled) block-based building mechanics. On the face of it, it makes perfect sense: bring Doom to the Minecraft crowd; let the audience be the creators; let everyone share anything they create.

Screenshot of split screen FPS shooter action in a game world

Above: Split-screen action helps define Gunscape’s social gameplay.

Blowfish crafted the tools and the means to share user-generated content, as well as its own levels that acted as examples of how the game could play out along its blocky, stylized path. The intention (and hope) was that users would embrace the user-friendly tools to craft numerous shooter experiences. Oftentimes in these situations, it works out that 1 percent of the total audience creates the content that the remaining 99 percent consume. Not so with Gunscape as Lee suggests that “most of them” have been involved in map creation to some extent.

Collaborative projects have also helped the community overcome weaknesses in the supplied map design mechanics. Lee freely admits that the omission of timers attached to enemies or map events “was probably an oversight on our part.” But the community solved the shortcoming themselves. Since you can set monster spawns and traps, like darts, though it was never specified in any documentation, darts deal a set amount of damage. Map creators figured out this amount so that they could trigger a trap that spawns a monster and then fires darts at it to kill it over a set number of hits, which then can trigger another event. And so, a Macgyver’d timer system was borne.

Screenshot of split screen FPS shooter action in a game world

Above: If some level designs look “inspired by” classic FPS games like Wolfenstein 3D, DOOM, and Quake, that’s not by accident.

“It’s a big inspiration for us to keep working on it,” says Lee of the community’s commitment to working with the tools provided. That’s despite Lee admitting the game has not performed financially as well as they had hoped, and hasn’t realized some of the creative goals the team had in mind in the early development phase. “Our initial idea was much grander,” Lee says, “so that it would become a RPG, so people could create those games as well.”

This has fed into the desire to maintain support for the community and the project as it was initially envisaged. “We still get people in, and they say they love it, so we want to keep investing in it,” says Lee. And it’s not like any of these efforts will be wasted as the development of the underlying technologies are being crafted so that they can be used in other projects “so it’s not like something we do now won’t ever be used again,” Lee adds.

Building for the future

Success can clearly be measured in multiple ways. Keeping the lights on—“maintaining sustainability”—can be a core goal, but it also presents opportunity. For Blowfish that emerges from the interns who have gravitated from entry positions to full-time roles. With over half the staff freshly out of game colleges, it shouldn’t be a stretch to consider their position a little precarious.

But that’s the point for Blowfish. Not only has this renegade group of executives been backed by straight-outa-college enthusiasts, but the results have been….sustainable.

While the trajectory is clear to the founders, gamers still need to peel the shell that reveals the map that tells a story that matters. And it doesn’t matter whether it comes from an intern, a noob, or an executive, the only criteria that counts among iOS and general PC releases is that the best game wins.

Let the games begin.

Refocus. Rebuild. Remake Something New

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Refocus. Rebuild. Remake something new Get more game dev news and related topics from Intel on VentureBeat.

Image - promo image, two lead characters in the foreground at the side of image, bots and spaceships battling behind them, center and background

You might think it’s pretty easy to start a new game development shop and craft a hit when your résumé reads as a who’s-who of game development royalty. That’s not the case, though, as even Jason Coleman, founder and President of Sparkypants can attest.

His credits include working with legends such as Sid Meier at Microprose on Civilization* II, and also on Rise of Nations*, among others. After a spell contracting, Coleman revealed that many former colleagues from those days of high level realtime strategy game (RTS) development wanted to reunite with all the opportunities still available for crafting unique gaming experiences.

Their first game, Dropzone*, from the new studio, Sparkypants, pulls on some familiar elements of current MOBA games, but adds pieces that have defined RTS games of the past. What has changed for Coleman’s team is how the ability to rapidly iterate the game—almost daily, in fact—has led to consistent improvements.

“We can jam stuff in, rip stuff out, and it’s really driven our development process,” says Coleman. “It makes us think about games as gamers and developers, and also as spectators,” he adds.

This process also ensures that the original game design doc is essentially thrown out of the window. Initially codenamed “Sportsball”—with an aim to make it as competitive and accessible as sports—the core premise was to deconstruct the traditional RTS for the modern day.

“Vision for what we’re trying to do from the first day was simple: Can we reimagine an RTS for the modern day, and reconstruct it,” says Coleman.

It meant asking some fundamental questions: What are the biggest challenges we have as gamers? “Time,” states Coleman, “so we’re playing mobile games because that’s where we can find the time, and that’s where the 15-minute time limit came from.”

Limiting matches into this digestible format required a set of new considerations for the veteran crew. Chief among them was to still guarantee that the experience was really satisfying.

Screenshot of colorful player and creature engaged in battle with various lighting effects on a 3d arena.

Above: Action gets intense in the strategy game, Dropzone*.

“But to do that you have to strip out familiar features of an RTS, such as base-building,” says Coleman. That resulted in focusing on other key elements that will scratch the strategic itch for traditional RTS players, as well as provide a core entertaining backbone that would entice new players into the fold.

“All the changes came out of experimentation. We had a saying that if it takes longer to talk about than to implement it we’ll just go implement it,” Coleman reveals.

One significant change to the original design vision was promoted by the community. Finally the team realized that after all their discussions about its merits, they could just build it and find out if it turned out to be fun. That’s where adding the ability to play three-versus-three came about, building on that first vision that the game should just be one-on-one.

“That absolutely came out of the community. We always said it was a stupid idea, and our hardcore one-vs-one players also thought it was a bad idea. But this was a perfect example of us talking about it so much and just dismissed it, and we really should have just made it at the start… It took us a morning to do a hacky version of it and it was immediately fun. But there is a skill curve to Dropzone that is steep at the beginning…so the three-v-three with one hero each lowers that drastically,” says Coleman of this significant change to their plan.

This addition also brought new players into the Dropzone fold as Coleman accepts that managing three heroes, understanding map control points, and remaining aware of your opponent can be “really intense from the beginning, and possibly also exhausting!”

Given the heavy competition in the MOBA genre, it was vital that Dropzone embrace its unique mechanics to become a standout product in the space. For the Sparkypants team that involved maintaining the intensity that’s so much a part of the RTS experience.

“The main resource to manage is the player’s attention span,” says Coleman of a core design philosophy for the genre. “The goal is to tune the game so that there’s always something to do, and always a little more than you can manage.

Having three heroes to control made a huge difference, along with giving them strategic reasons to split up. And so map control in Dropzone is important, as is ensuring there is the right balance of elements to do around the map,” explains Coleman.

Though it wasn’t the initial intention to establish Dropzone as a prominent eSport feature, additions like the Spectator Mode and Ranked play for one-versus-one play should help it carve a place in this burgeoning market. Coleman accepts that competitive eSports play is often driven by a relatively small number of hardcore players, but that these players—through gameplay videos and streaming—drive a lot of aspirational aspects.

For Dropzone the disputed, now popular three-versus-three mode, also afforded newcomers the ability to learn the ropes that they could then take into one-versus-one as they watch other players online and figure their techniques.

Coleman is also proud of the team’s studio-built game engine. Starting a new company and designing a new game can be a significant challenge in itself without the added pressure of dealing with core technical hurdles. In particular, the rapid load times let players change their heroes and skill loadouts, then quickly enter the sandbox mode and test new tactics. Understanding how the skills can work collectively is part of that skill component that has hooked the fan base throughout the game’s life in Early Access, and particularly now that it has gone free-to-play.

Though its roots may lie in the details of classic strategy gaming experiences it maintains a fun spirit in its heroes that include a dog and a brain!

“And we’ve been asked when we are doing a dolphin hero,” adds Coleman. A great deal of detail has been packed into a digestible 15-minute game format, and with the daily iterations and suggestions from the community driving efforts to improve the experience, expect Dropzone to be a serious factor for strategy gamers new and old.

Using Intel® MPI Library on Intel® Xeon Phi™ Product Family

$
0
0

Introduction

The Message Passing Interface (MPI) standard is a message-passing library, a collection of routines used in distributed-memory parallel programing. This document is designed to help users get started writing code and running MPI applications using the Intel® MPI Library on a development platform that includes the Intel® Xeon Phi™ processor or coprocessor. The Intel MPI Library is a multi-fabric message passing library that implements the MPI-3.1 specification (see Table 1).

In this document, the Intel MPI Library 2017 and 2018 Beta for Linux* OS are used.

Table 1. Intel® MPI Library at a glance

Processors

Intel® processors, coprocessors, and compatibles

Languages

Natively supports C,C++, and Fortran development

Development Environments

Microsoft Visual Studio* (Windows*), Eclipse*/CDT* (Linux*)

Operating Systems

Linux and Windows

Interconnect Fabric Support

Shared memory
RDMA-capable network fabrics through DAPL* (for example, InfiniBand*, Myrinet*)
Intel® Omni-Path Architecture
Sockets (for example, TCP/IP over Ethernet, Gigabit Ethernet*) and others.

This document summarizes the steps to build and run an MPI application on an Intel® Xeon Phi™ processor x200, on an Intel® Xeon Phi™ coprocessor x200 and Intel® Xeon Phi™ coprocessor x100 natively or symmetrically. First, we introduce the Intel Xeon Phi processor x200 product family and Intel Xeon Phi processor x100 product family and the MPI programing models.

Intel® Xeon Phi™ Processor Architecture

Intel Xeon Phi processor x200 product family architecture: There are two versions of this product. The processor version is the host processor and the coprocessor version requires an Intel® Xeon® processor host. Both versions share the architecture below (see Figure 1):

  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
  • Up to 72 cores with 2D mesh architecture
  • Each core has two 512-bit vector processing units (VPUs) and four hardware threads
  • Each pair of cores (tile) shares 1 MB L2 cache
  • 8 or 16 GB high-bandwidth on package memory (MCDRAM)
  • 6 channels DDR4, up to 384 GB (available in the processor version only)
  • For the coprocessor, the third-generation PCIe* is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x200 architecture

Figure 1. Intel® Xeon Phi™ processor x200 architecture.

To enable the functionalities of the Intel Xeon Phi processor x200, you need to download and install the Intel Xeon Phi processor software available here.

The Intel Xeon Phi coprocessor x200 attaches to an Intel Xeon processor-based host via a third-generation PCIe interface. The coprocessor runs on a standard Linux OS. It can be used as an extension to the host (so the host can offload the workload) or as an independent compute node. The first step to bring an Intel Xeon Phi coprocessor x200 into service is to install the Intel® Manycore Platform Software Stack (Intel® MPSS) 4.x on the host, which is available here. The Intel MPSS is a collection of software including device drivers, coprocessor management utilities, and the Linux OS for the coprocessor.

Intel Xeon Phi coprocessor x100 architecture: the Intel Xeon Phi coprocessor x100 is the first-generation of the Intel Xeon Phi product family. The coprocessor attaches to an Intel Xeon processor-based host via a second-generation PCIe interface. It runs on an OS separate from the host and has the following architecture (see Figure 2):

  • Intel® Initial Many Core Instructions
  • Up to 61 cores with high-bandwidth, bidirectional ring interconnect architecture
  • Each core has a 512-bit wide VPU and four hardware threads
  • Each core has a private 512-KB L2 cache
  • 16 GB GDDR5 memory
  • The second-generation PCIe is connected to the host

 Colorful depiction of the Intel® Xeon Phi™ processor x100 architecture

Figure 2. Intel® Xeon Phi™ processor x100 architecture.

To bring the Intel Xeon Phi coprocessor x100 into service, you must install the Intel MPSS 3.x on the host, which can be downloaded here.

MPI Programming Models

The Intel MPI Library supports the following MPI programming models (see Figure 3):

  • Host-only model (Intel Xeon processor or Intel Xeon Phi processor): In this mode, all MPI ranks reside and execute the workload on the host CPU only (or Intel Xeon Phi processor only).
  • Offload model: In this mode, the MPI ranks reside solely on the Intel Xeon processor host. The MPI ranks use offload capabilities of the Intel® C/C++ Compiler or Intel® Fortran Compiler to offload some workloads to the coprocessors. Typically, one MPI rank is used per host, and the MPI rank offloads to the coprocessor(s).
  • Coprocessor-only model: In this native mode, the MPI ranks reside solely inside the coprocessor. The application can be launched from the coprocessor.
  • Symmetric model: In this mode, the MPI ranks reside on the host and the coprocessors. The application can be launched from the host.

 MPI programing models

Figure 3. MPI programing models.

Using the Intel® MPI Library

This section shows how to build and run an MPI application in the following configurations: on an Intel Xeon Phi processor x200, on a system with one or more Intel Xeon Phi coprocessor x200, and on a system with one or more Intel Xeon Phi coprocessor x100 (see Figure 4).

 Black and white, different configurations of the Intel® MPI Library

Figure 4. Different configurations: (a) standalone Intel® Xeon Phi™ processor x200, (b) Intel Xeon Phi coprocessor x200 connected to a system with an Intel® Xeon® processor, and (c) Intel® Xeon Phi™ coprocessor x100 connected to a system with an Intel Xeon processor.

Installing the Intel® MPI Library

The Intel MPI Library is packaged as a standalone product or as a part of the Intel® Parallel Studio XE Cluster Edition.

By default, the Intel MPI Library will be installed in the path /opt/intel/impi on the host or the Intel Xeon Phi processor. To start, follow the appropriate directions to install the latest versions of the Intel C/C++ Compiler and the Intel Fortran Compiler.

You can purchase or try the free 30-day evaluation of the Intel Parallel Studio XE from https://software.intel.com/en-us/intel-parallel-studio-xe. These instructions assume that you have the Intel MPI Library tar file - l_mpi_<version>.<package_num>.tgz. This is the latest stable release of the library at the time of writing this article. To check if a newer version exists, log into the Intel® Registration Center. The instructions below are valid for all current and subsequent releases.

As root user, untar the tar file l_mpi_<version>.<package_num>.tgz:

# tar –xzvf l_mpi_<version>.<package_num>.tgz
# cd l_mpi_<version>.<package_num>

Execute the install script on the host and follow the instructions. The installation will be placed in the default installation directory /opt/intel/impi/<version>.<package_num> assuming you are installing the library with root permission.

# ./install.sh

Compiling an MPI program

To compile an MPI program on the host or on an Intel Xeon Phi processor x200:

Before compiling a MPI program you need to establish the proper environment settings for the compiler and for the Intel MPI Library

$ source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
$ source /opt/intel/impi/<version>.<package_num>/bin64/mpivars.sh

or if you installed the Intel® Parallel Studio XE Cluster Edition, you can simply source the configuration script:

$ source /opt/intel/parallel_studio_xe_<version>/psxevars.sh intel64

Compile and link your MPI program using an appropriate compiler command:

To compile and link with the Intel MPI Library, use the appropriate commands from Table 2.

Table 2. MPI compilation Linux* command.

Programming LanguageMPI Compilation Linux* Command
Cmpiicc
C++mpiicpc
Fortran 77 / 95mpiifort

For example, to compile the C program for the host, you can use the wrapper mpiicc:

$ mpiicc ./myprogram.c –o myprogram

To compile the program for Intel Xeon Phi processor x200 and Intel Xeon Phi coprocessor x200, add the knob–xMIC-AVX512 to take advantage of the Intel AVX-512 instruction set architecture (ISA) existing on this architecture. For example, the following command compiles a C program for the Intel Xeon Phi product family x200 using the Intel AVX-512 ISA:

$ mpiicc –xMIC-AVX512 ./myprogram.c –o myprogram.knl

To compile the program for the Intel Xeon Phi coprocessor x100, add the knob–mmic. The following command show how to compile a C program for Intel Xeon Phi coprocessor x100:

$ mpiicc –mmic ./myprogram.c –o myprogram.knc

Running an MPI program on the Intel Xeon Phi processor x200

To run the application on the Intel Xeon Phi processor x200, use the script mpirun:

$ mpirun –n <# of processes> ./myprogram.knl

where n is the number of MPI processes to launch on the processor.

Running an MPI program on the Intel Xeon Phi coprocessor x200 and Intel Xeon Phi coprocessor x100

To run an application on the coprocessors, the following steps are needed:

  • Start the MPSS service if it was stopped previously:

    $ sudo systemctl start mpss

  • Transfer the MPI executable from the host to the coprocessor. For example, use the scp utility to transfer the executable (for the Intel Xeon Phi coprocessor x100) to the coprocessor named mic0:

    $ scp myprogram.knl mic0:~/myprogram.knc

  • Transfer the MPI libraries and compiler libraries to the coprocessors: before the first run of an MPI application on the Intel Xeon Phi coprocessors, we need to copy the appropriate MPI libraries, compiler libraries to the following directories on each coprocessor equipped on this system: for coprocessor x200, libraries under /lib64 directory are transferred; for coprocessor x100, libraries under /mic directory are transferred.

For example, we issue the copy to the first coprocessor x100 called mic0: the mic0 coprocessor is accessible via the IP address 172.31.1.1 as its IP address. Note that all coprocessors have unique IP addresses since they are treated as just other uniquely addressable machines. You can refer to the first coprocessor as mic0 or its IP address.

# sudo scp /opt/intel/impi/2017.3.196/mic/bin/* mic0:/bin/
# sudo scp /opt/intel/impi/2017.3.196/mic/lib/* mic0:/lib64/
# sudo scp /opt/intel/composer_xe_2017.3.196/compiler/lib/mic/* mic0:/lib64/

Instead of copying the MPI and compiler libraries manually, you can also run the script shown below, to transfer to the two coprocessor mic0 and mic1:

#!/bin/sh

export COPROCESSORS="mic0 mic1"
export BINDIR="/opt/intel/impi/2017.3.196/mic/bin"
export LIBDIR="/opt/intel/impi/2017.3.196/mic/lib"
export COMPILERLIB="/opt/intel/compilers_and_libraries_2017/linux/lib/mic"

for coprocessor in `echo $COPROCESSORS`
do
   for prog in mpiexec mpiexec.hydra pmi_proxy mpirun
   do
      sudo scp $BINDIR/$prog $coprocessor:/bin/$prog
   done

   for lib in libmpi.so.12 libmpifort.so.12 libmpicxx.so.12
   do
      sudo scp $LIBDIR/$lib $coprocessor:/lib64/$lib
   done

   for lib in libimf.so libsvml.so libintlc.so.5
   do
      sudo scp $COMPILERLIB/$lib $coprocessor:/lib64/$lib
   done
done

Script used for transferring MPI libraries to two coprocessors.

Another approach is to NFS mount the coprocessors’ file system from the host so that the coprocessors can have access to their MPI libraries from there. One advantage of using NFS mounts is that it saves RAM space on the coprocessors. The details on how to set up NFS mounts can be found in the first example in this document.

To run the application natively on the coprocessor, log in to the coprocessor and then run thempirun script:

$ ssh mic0
$ mpirun –n <# of processes> ./myprogram.knc

where n is the number of MPI processes to launch on the coprocessor.

Finally, to run an MPI program from the host (symmetrically), additional steps are needed:

Set the Intel MPI environment variable I_MPI_MIC to let the Intel MPI Library recognize the coprocessors:

$ export I_MPI_MIC=enable

Disable the firewall in the host:

$ systemctl status firewalld
$ sudo systemctl stop firewalld

For multi-card use, configure Intel MPSS peer-to-peer so that each card can ping others:

$ sudo /sbin/sysctl -w net.ipv4.ip_forward=1

If you want to get debug information, include the flags -verbose and -genv I_MPI_DEBUG=n when running the application.

The following sections include sample MPI programs written in C. The first example shows how to compile and run a program for Intel Xeon Phi processor x200 and for Intel Xeon Phi coprocessor x200. The second example shows how to compile and run a program for Intel Xeon Phi coprocessor x100.

Example 1

For illustration purposes, this example shows how to build and run an Intel MPI application in symmetric mode on a host that connects to two Intel Xeon Phi coprocessors x200. Note that the driver Intel MPSS 4.x should be installed on the host to enable the Intel Xeon Phi coprocessor x200.

In this example, use the integral presentation below to calculate Pi (π):

Image of a mathematical equation

Appendix A includes the implementation program. The workload is divided among the MPI ranks. Each rank spawns a team of OpenMP* threads, and each thread works on a chunk of the workload to take advantage of vectorization. First, compile and run this application on the Intel Xeon processor host. Since this program uses OpenMP, you need to compile the program with OpenMP libraries. Note that the Intel Parallel Studio XE 2018 is used in this example.

Set the environment variables, compile the application for the host, and then generate the optimization report on vectorization and OpenMP:

$ source /opt/intel/compilers_and_libraries_2018/linux/bin/compilervars.sh intel64
$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -o mpitest

To run two ranks on the host:

$ mpirun -host localhost -n 2 ./mpitest
Hello world: rank 0 of 2 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 2 running on knl-lb0.jf.intel.com
FROM RANK 1 - numthreads = 32
FROM RANK 0 - numthreads = 32

Elapsed time from rank 0:    8246.90 (usec)
Elapsed time from rank 1:    8423.09 (usec)
rank 0 pi=   3.141613006592

Next, compile the application for the Intel Xeon Phi coprocessor x200 and transfer the executable to the coprocessors mic0 and mic1 (assume you already set passwordless on the coprocessors).

$ mpiicc mpitest.c -qopenmp -O3 -qopt-report=5 -qopt-report-phase:vec,openmp -xMIC-AVX512 -o mpitest.knl
$ scp mpitest.knl mic0:~/.
$ scp mpitest.knl mic1:~/.

Enable MPI for the coprocessors and disable the firewall in the host:

$ export I_MPI_MIC=enable
$ sudo systemctl stop firewalld

This example also shows how to mount shared directory using the Network File System (NFS). As root, you mount the /opt/intel directory where the Intel C++ Compiler and Intel MPI are installed. First, add descriptors in the /etc/exports configuration file on the host to share the directory /opt/intelwith the coprocessors, whose IP addresses are 172.31.1.1 and 172.31.2.1 with read-only (ro) privilege.

[host~]# cat /etc/exports
/opt/intel 172.31.1.1(ro,async,no_root_squash)
/opt/intel 172.31.2.1(ro,async,no_root_squash)

Update the NFS export table and restart the NFS server in the host:

[host~]# exportfs –a
[host~]# service nfs restart

Next, log in on the coprocessors and create the mount point /opt/intel:

[host~]# ssh mic0
mic0:~# mkdir /opt
mic0:~# mkdir /opt/intel

 

Insert the descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1” to the /etc/fstab file in mic0:

mic0:~# cat /etc/fstab
/dev/root            /                    auto       defaults              1  1
proc                 /proc                proc       defaults              0  0
devpts               /dev/pts             devpts     mode=0620,gid=5       0  0
tmpfs                /run                 tmpfs      mode=0755,nodev,nosuid,strictatime 0  0
tmpfs                /var/volatile        tmpfs      defaults,size=85%     0  0
172.31.1.254:/opt/intel /opt/intel nfs defaults                            1  1

Finally, mount the shared directory /opt/intel on the coprocessor:

mic0:~# mount –a

Repeat this procedure for mic1 with this descriptor “172.31.2.254:/opt/intel /opt/intel nfs defaults 1 1” added to the /etc/fstab file in mic1.

Make sure that mic0 and mic1 are included in the /etc/hosts file:

$ cat /etc/hosts
127.0.0.1       localhost
::1             localhost
172.31.1.1      mic0
172.31.2.1      mic1

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 ~/mpitest.knl : -host mic1 -n 1 ~/mpitest.knl
Hello world: rank 0 of 3 running on knl-lb0
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 2 - numthreads = 272
FROM RANK 1 - numthreads = 272
Elapsed time from rank 0:   12114.05 (usec)
Elapsed time from rank 1:  136089.09 (usec)
Elapsed time from rank 2:  125049.11 (usec)
rank 0 pi=   3.141597270966

By default, the maximum number of hardware threads available on each compute node is used. However, you can change this default behavior by inserting the local environment variable –env in that compute node. For example, to set the number of OpenMP threads on mic0 to 68 and set the compact affinity, you can use the command:

$ mpirun -host localhost -n 1 ./mpitest : -host mic0 -n 1 -env OMP_NUM_THREADS=68 -env KMP_AFFINITY=compact ~/mpitest : -host mic1 -n 1 ~/mpitest
Hello world: rank 0 of 3 running on knl-lb0.jf.intel.com
Hello world: rank 1 of 3 running on mic0
Hello world: rank 2 of 3 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 68
FROM RANK 2 - numthreads = 272
Elapsed time from rank 0:   11068.11 (usec)
Elapsed time from rank 1:   57780.98 (usec)
Elapsed time from rank 2:  133417.13 (usec)
rank 0 pi=   3.141597270966

To simplify the launch process, define a file with all machine names, name all the executables, and then move them to a predefined directory. For example, all executables are named mpitest and are located in user home directories:

$ cat hosts_file
knl-lb0:1
mic0:2
mic1:2

$ mpirun -machinefile hosts_file -n 5 ~/mpitest
Hello world: rank 0 of 5 running on knl-lb0
Hello world: rank 1 of 5 running on mic0
Hello world: rank 2 of 5 running on mic0
Hello world: rank 3 of 5 running on mic1
Hello world: rank 4 of 5 running on mic1
FROM RANK 0 - numthreads = 64
FROM RANK 1 - numthreads = 136
FROM RANK 3 - numthreads = 136
FROM RANK 2 - numthreads = 136
FROM RANK 4 - numthreads = 136
Elapsed time from rank 0:   11260.03 (usec)
Elapsed time from rank 1:   71480.04 (usec)
Elapsed time from rank 2:   69352.15 (usec)
Elapsed time from rank 3:   74187.99 (usec)
Elapsed time from rank 4:   67718.98 (usec)
rank 0 pi=   3.141598224640

 

Example 2

Example 2 shows how to build and run an MPI application in symmetric model on a host that connects to two Intel Xeon Phi coprocessors x100. Note that the driver Intel MPSS 3.x should be installed for the Intel Xeon Phi coprocessor x100.

The sample program estimates the calculation of Pi (π) using a Monte Carlo method. Consider a sphere centered at the origin and circumscribed by a cube. The sphere’s radius is r and the cube edge length is 2r. The volumes of a sphere and a cube are given by

Image of a mathematical equation

The first octant of the coordinate system contains one eighth of the volumes of both the sphere and the cube; the volumes in that octant are given by:

Image of a mathematical equation

If we generate Nc points uniformly and randomly in the cube within this octant, we expect that about Ns points will be inside the sphere’s volume according to the following ratio:

Image of a mathematical equation

Therefore, the estimated Pi (π) is calculated by

Image of a mathematical equation

where Nc is the number of points generated in the portion of the cube residing in the first octant, and Ns is the total number of points found inside the portion of the sphere residing in the first octant.

In the implementation, rank 0 (process) is responsible for dividing the work among the other n ranks. Each rank is assigned a chunk of work, and the summation is used to estimate the number Pi. Rank 0 divides the x-axis into n equal segments. Each rank generates (Nc /n) points in the assigned segment, and then computes the number of points in the first octant of the sphere (see Figure 5).

Image of a mathematical results

Figure 5. Each MPI rank handles a different portion in the first octant.

The pseudo code is shown below:

Rank 0 generate n random seed
Rank 0 broadcast all random seeds to n rank
For each rank i [0, n-1]
receive the corresponding seed
set num_inside = 0
For j=0 to Nc / n
generate a point with coordinates
x between [i/n, (i+1)/n]
y between [0, 1]
z between [0, 1]
			compute the distance d = x^2 + y^2 + z^2
			if distance d <= 1, increment num_inside
		Send num_inside back to rank 0
	Rank 0 set Ns  to the sum of all num_inside
	Rank 0 compute Pi = 6 * Ns  / Nc

In order to build the application montecarlo.knc for the Intel Xeon Phi coprocessors x100, the Intel C++ Compiler 2017 is used. Appendix B includes the implementation program. Note that this example just simply shows how to run the code on an Intel Xeon Phi coprocessor x100. You can optimize the sample code for further improvement.

$ source /opt/intel/compilers_and_libraries_2017/linux/bin/compilervars.sh intel64
$ mpiicc –mmic montecarlo.c -o montecarlo.knc

Build the application for the host:

$ mpiicc montecarlo.c -o montecarlo

Transfer the application montecarlo.knc to the /tmp directory on the coprocessors using the scp utility. In this example, we issue the copy to two Intel Xeon Phi coprocessors x100.

$ scp ./montecarlo.knc mic0:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 $ scp ./montecarlo.knc mic1:/tmp/montecarlo.knc
montecarlo.knc     100% 17KB 16.9KB/s 00:00 

Transfer the MPI libraries and compiler libraries to the coprocessors using the script in Figure 5. Enable the MPI communication between host and Intel Xeon Phi coprocessors x100:

$ export I_MPI_MIC=enable

Run the mpirun script to start the application. The flag –n specifies the number of MPI processes and the flag –host specifies the machine name:

$ mpirun –n <# of processes> -host <hostname> <application>

We can run the application on multiple hosts by separating them with “:”. The first MPI rank (rank 0) always starts on the first part of the command:

$ mpirun –n <# of processes> -host <hostname1> <application> : –n <# of processes> -host <hostname2> <application>

This starts the rank 0 on hostname1 and other ranks on hostname2.

Now run the application on the host. The mpirun command shown below starts the application with 2 ranks on the host, 3 ranks on the coprocessor mic0, and 5 ranks on coprocessor mic1:

$ mpirun -n 2 -host localhost ./montecarlo : -n 3 -host mic0 /tmp/montecarlo.knc \
: -n 5 -host mic1 /tmp/montecarlo.knc

Hello world: rank 0 of 10 running on knc0
Hello world: rank 1 of 10 running on knc0
Hello world: rank 2 of 10 running on knc0-mic0
Hello world: rank 3 of 10 running on knc0-mic0
Hello world: rank 4 of 10 running on knc0-mic0
Hello world: rank 5 of 10 running on knc0-mic1
Hello world: rank 6 of 10 running on knc0-mic1
Hello world: rank 7 of 10 running on knc0-mic1
Hello world: rank 8 of 10 running on knc0-mic1
Hello world: rank 9 of 10 running on knc0-mic1
Elapsed time from rank 0:      13.87 (sec)
Elapsed time from rank 1:      14.01 (sec)
Elapsed time from rank 2:     195.16 (sec)
Elapsed time from rank 3:     195.17 (sec)
Elapsed time from rank 4:     195.39 (sec)
Elapsed time from rank 5:     195.07 (sec)
Elapsed time from rank 6:     194.98 (sec)
Elapsed time from rank 7:     223.32 (sec)
Elapsed time from rank 8:     194.22 (sec)
Elapsed time from rank 9:     193.70 (sec)
Out of 4294967295 points, there are 2248849344 points inside the sphere => pi=  3.141606330872

A shorthand way of doing this in symmetric mode is to use the –machinefile option for the mpirun command in coordination with the I_MPI_MIC_POSTFIX environment variable. In this case, make sure all executables are in the same location on the host and mic0 and mic1 cards.

The I_MPI_MIC_POSTFIX environment variable simply tells the library to add the .mic postfix when running on the cards (since the executables there are called montecarlo.knc).

$ export I_MPI_MIC_POSTFIX=.knc

Now set the rank mapping in your hosts file (by using the <host>:<#_ranks> format):

$ cat hosts_file
localhost:2
mic0:3
mic1:5

And run your executable:

$ mpirun -machinefile hosts_file /tmp/montecarlo

The nice thing about this syntax is that you only have to edit the hosts_file when deciding to change your number of ranks or need to add more cards.

As an alternative, you can ssh to a coprocessor and launch the application from there:

S ssh mic0
S mpirun -n 3 /tmp/montecarlo.knc
Hello world: rank 0 of 3 running on knc0-mic0
Hello world: rank 1 of 3 running on knc0-mic0
Hello world: rank 2 of 3 running on knc0-mic0
Elapsed time from rank 0:     650.47 (sec)
Elapsed time from rank 1:     650.61 (sec)
Elapsed time from rank 2:     648.01 (sec)
Out of 4294967295 points, there are 2248795855 points inside the sphere => pi=  3.141531467438

 

Summary

This document showed you how to compile and run simple MPI applications in symmetric model. In a heterogeneous computing system, the performance in each computational unit is different and this system behavior leads to the load imbalance problem. The Intel® Trace Analyzer and Collector can be used to analyze and understand the behavior of a complex MPI program running on a heterogeneous system. Using the Intel Trace Analyzer and Collector, you can quickly identify bottlenecks, evaluate load balancing, analyze performance, and identify communication hotspots. This powerful tool is essential for debugging and improving the performance of a MPI program running on a cluster with multiple computational units. For more details on using the Intel Trace Analyzer and Collector, read the whitepaper “Understanding MPI Load Imbalance with Intel® Trace Analyzer and Collector” available on /mic-developer. For more details, tips and tricks, and known workarounds, visit our Intel® Cluster Tools and the Intel® Xeon Phi™ Coprocessors page.

References

Appendix A

The code of the first sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 1.0)
//      Calculate the number PI using its integral representation.
//
//******************************************************************************
#include <stdio.h>
#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 1
#define TAG_TIME 2

const long ITER = 1024 * 1024;
const long SCALE = 16;
const long NUM_STEP = ITER * SCALE;

float calculate_partialPI(int n, int num) {
   unsigned long i;
   int  numthreads;
   float x, dx, pi = 0.0f;

   #pragma omp parallel
   #pragma omp master
   {
      numthreads = omp_get_num_threads();
      printf("FROM RANK %d - numthreads = %d\n", n, numthreads);
   }

   dx = 1.0 / NUM_STEP;

   unsigned long NUM_STEP1 = NUM_STEP / num;
   unsigned long begin = n * NUM_STEP1;
   unsigned long end = (n + 1) * NUM_STEP1;
   #pragma omp parallel for reduction(+:pi)
   for (i = begin; i < end; i++)
   {
      x = (i + 0.5f) / NUM_STEP;
      pi += (4.0f * dx) / (1.0f + x*x);
   }

   return pi;
}

int main(int argc, char **argv)
{
   float pi1, total_pi;
   double startprocess;
   int i, id, remote_id, num_procs, namelen;
   char name[MPI_MAX_PROCESSOR_NAME];
   MPI_Status stat;

   if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
   {
      printf ("Failed to initialize MPI\n");
      return (-1);
   }

   // Create the communicator, and retrieve the number of processes.
   MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

   // Determine the rank of the process.
   MPI_Comm_rank (MPI_COMM_WORLD, &id);

   // Get machine name
   MPI_Get_processor_name (name, &namelen);

   if (id == MASTER)
   {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
      {
         MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
         MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

         printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
      }
   }
   else
   {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
   }

   startprocess = MPI_Wtime();

   pi1 = calculate_partialPI(id, num_procs);

   double elapsed = MPI_Wtime() - startprocess;

   MPI_Reduce (&pi1, &total_pi, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD);
   if (id == MASTER)
   {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (usec)\n", MASTER, 1000000 * timeprocess[MASTER]);

      for (i = 1; i < num_procs; i++)
      {
         // Rank 0 waits for elapsed time value
         MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
         printf("Elapsed time from rank %d: %10.2f (usec)\n", i, 1000000 *timeprocess[i]);
      }

      printf("rank %d pi= %16.12f\n", id, total_pi);
   }
   else
   {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
   }

   // Terminate MPI.
   MPI_Finalize();
   return 0;
}

 

Appendix B

The code of the second sample program is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
//******************************************************************************
// Content: (version 0.5)
//      Based on a Monto Carlo method, this MPI sample code uses volumes to
//      estimate the number PI.
//
//******************************************************************************
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <math.h>

#include "mpi.h"

#define MASTER 0
#define TAG_HELLO 4
#define TAG_TEST 5
#define TAG_TIME 6

int main(int argc, char *argv[])
{
  int i, id, remote_id, num_procs;

  MPI_Status stat;
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];

  // Start MPI.
  if (MPI_Init (&argc, &argv) != MPI_SUCCESS)
    {
      printf ("Failed to initialize MPI\n");
      return (-1);
    }

  // Create the communicator, and retrieve the number of processes.
  MPI_Comm_size (MPI_COMM_WORLD, &num_procs);

  // Determine the rank of the process.
  MPI_Comm_rank (MPI_COMM_WORLD, &id);
    // Get machine name
  MPI_Get_processor_name (name, &namelen);

  if (id == MASTER)
    {
      printf ("Hello world: rank %d of %d running on %s\n", id, num_procs, name);

      for (i = 1; i<num_procs; i++)
	{
	  MPI_Recv (&remote_id, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&num_procs, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (&namelen, 1, MPI_INT, i, TAG_HELLO, MPI_COMM_WORLD, &stat);
	  MPI_Recv (name, namelen+1, MPI_CHAR, i, TAG_HELLO, MPI_COMM_WORLD, &stat);

	  printf ("Hello world: rank %d of %d running on %s\n", remote_id, num_procs, name);
	}
    }
  else
    {
      MPI_Send (&id, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&num_procs, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (&namelen, 1, MPI_INT, MASTER, TAG_HELLO, MPI_COMM_WORLD);
      MPI_Send (name, namelen+1, MPI_CHAR, MASTER, TAG_HELLO, MPI_COMM_WORLD);
    }

  // Rank 0 distributes seek randomly to all processes.
  double startprocess, endprocess;

  int distributed_seed = 0;
  int *buff;

  buff = (int *)malloc(num_procs * sizeof(int));

  unsigned int MAX_NUM_POINTS = pow (2,32) - 1;
  unsigned int num_local_points = MAX_NUM_POINTS / num_procs;

  if (id == MASTER)
    {
      srand (time(NULL));

      for (i=0; i<num_procs; i++)
	{
	  distributed_seed = rand();
	  buff[i] = distributed_seed;
	}
    }

  // Broadcast the seed to all processes
  MPI_Bcast(buff, num_procs, MPI_INT, MASTER, MPI_COMM_WORLD);

  // At this point, every process (including rank 0) has a different seed. Using their seed,
  // each process generates N points randomly in the interval [1/n, 1, 1]
  startprocess = MPI_Wtime();

  srand (buff[id]);

  unsigned int point = 0;
  unsigned int rand_MAX = 128000;
  float p_x, p_y, p_z;
  float temp, temp2, pi;
  double result;
  unsigned int inside = 0, total_inside = 0;
    for (point=0; point<num_local_points; point++)
    {
      temp = (rand() % (rand_MAX+1));
      p_x = temp / rand_MAX;
      p_x = p_x / num_procs;

      temp2 = (float)id / num_procs;	// id belongs to 0, num_procs-1
      p_x += temp2;

      temp = (rand() % (rand_MAX+1));
      p_y = temp / rand_MAX;

      temp = (rand() % (rand_MAX+1));
      p_z = temp / rand_MAX;

      // Compute the number of points residing inside of the 1/8 of the sphere
      result = p_x * p_x + p_y * p_y + p_z * p_z;

      if (result <= 1)
	  {
		inside++;
	  }
    }

  double elapsed = MPI_Wtime() - startprocess;

  MPI_Reduce (&inside, &total_inside, 1, MPI_UNSIGNED, MPI_SUM, MASTER, MPI_COMM_WORLD);

#if DEBUG
  printf ("rank %d counts %u points inside the sphere\n", id, inside);
#endif

  if (id == MASTER)
    {
      double timeprocess[num_procs];

      timeprocess[MASTER] = elapsed;
      printf("Elapsed time from rank %d: %10.2f (sec) \n", MASTER, timeprocess[MASTER]);

      for (i=1; i<num_procs; i++)
	{
	  // Rank 0 waits for elapsed time value
	  MPI_Recv (&timeprocess[i], 1, MPI_DOUBLE, i, TAG_TIME, MPI_COMM_WORLD, &stat);
	  printf("Elapsed time from rank %d: %10.2f (sec) \n", i, timeprocess[i]);
	}

      temp = 6 * (float)total_inside;
      pi = temp / MAX_NUM_POINTS;
      printf ( "Out of %u points, there are %u points inside the sphere => pi=%16.12f\n", MAX_NUM_POINTS, total_inside, pi);
    }
  else
    {
      // Send back the processing time (in second)
      MPI_Send (&elapsed, 1, MPI_DOUBLE, MASTER, TAG_TIME, MPI_COMM_WORLD);
    }

  free(buff);

  // Terminate MPI.
  MPI_Finalize();

  return 0;
}

Intel® Manycore Platform Software Stack Archive for the Intel® Xeon Phi™ Coprocessor x200 Product Family

$
0
0

On this page you will find the past releases of the Intel® Manycore Platform Software Stack (Intel® MPSS) for the Intel® Xeon Phi™ coprocessor x200 product family. The most recent release is found here: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. We recommend customers use the latest release wherever possible.

  • N-1 release for Intel® MPSS 4.4.x

Intel MPSS 4.4.0 HotFix 1 release for Linux*

Intel Manycore Platform Software Stack Version

Downloads Available

Size (range)

MD5 Checksum

Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)

RHEL 7.3

214MB

8a015c38379b8be42c8045d3ceb44545

 

RHEL 7.2

214MB

694b7b908c12061543d2982695985d8b

 

SLES 12.2

213MB

506ab12af774f78fa8e107fd7a4f96fd

 

SLES 12.1

213MB

b8520888954e846e8ac8604d62a9ba96

 

SLES 12.0

213MB

88a3a4415afae1238453ced7a0df28ea

 

Card installer file (mpss-4.4.0-card.tar)

761MB

d26e26868297cea5fd4ffafe8d78b66e

 

Source file (mpss-4.4.0-card-source.tar)

514MB

127713d06496090821b5bb3613c95b30

Description

Description

Last Updated On

Size (approx.)

releaseNotes-linux.txt

Release notes (English)

May 2017

15KB

readme.txt

Readme (includes installation instructions) for Linux (English)

May 2017

17KB

mpss_user_guide.pdf

Intel MPSS user guide

May 2017

3MB

eula.txt

End User License Agreement (Important: Read before downloading, installing, or using)

May 2017

33KB

 

Intel MPSS 4.4.0 HotFix 1 release for Windows*

Intel Manycore Platform Software Stack Version

Downloads Available

Size

MD5 Checksum

Intel MPSS 4.4.0 Hotfix 1 (released: May 8, 2017)

mpss-4.4.0-windows.zip

1091MB

204a65b36858842f472a37c77129eb53

Description

Description

Last Updated On

Size (approx.)

releasenotes-windows.txt

English - Release notes

May 2017

7KB

readme-windows.pdf

English - Readme for Windows

May 2017

399KB

mpss_users_guide_windows

Intel MPSS user guide for Windows

May 2017

3MB

eula.txt

End User License Agreement (Important: Read before downloading, installing, or using)

May 2017

33KB

 

The discussion forum at http://software.intel.com/en-us/forums/intel-many-integrated-core is available to join and discuss any enhancements or issues with the Intel MPSS.

Thread pool behavior for Apollo Lake Intel® SDK for OpenCL™ applications

$
0
0

This article describes internal driver optimizations for developers using Intel Atom™ Processors, Celeron™ Processors, and Pentium™ Processors based on the "Apollo Lake" Platform (Broxton Graphics).   The intent is to clarify existing documentation.  The optimizations described are completely transparent.  The only change needed from a developer perspective is to be aware that for this special case applications should be designed for the thread pool configuration instead of the underlying hardware.

Driver thread pool optimizations maximize Apollo Lake EU performance 

For  Intel® Core™ and Intel® Xeon® processors with integrated graphics, the number of EUs and EUs per subslice is large enough that mapping thread pools directly to subslices is efficient.  Tying thread pool implementation to hardware means that application behavior and hardware details can be described together in a way that is easy to visualize and remember.  This approach was used by many reference documents such as The Compute Architecture of Intel Processor Graphics Gen9.  

However, for the relatively smaller GPUs in the embedded processors listed above this approach could sometimes result in non-optimal mapping.  For these processors EUs are now pooled across subslices creating "virtual subslices" which do not match the hardware.  In this case it can help to understand where behavior is driven by thread pools instead of hardware layout.

Thread pools and physical subslices for HD graphics 505

There are two GPU configurations for Apollo Lake:

  • Intel® HD Graphics 505: 18 EUs, 3x6 physical, now using 2x9 thread pools (shown above)
  • Intel® HD Graphics 500: 12 EUs, 2x6 physical, now using 1x12 thread pools (not shown)

The thread pools determine how you should write your application, not the physical hardware.  For example, if you have HD Graphics 505 your application should be written as if there were two subslices with 9 EUs, not three subslices with six EUs.    

Extensive testing proved that the worst case was to match legacy configuration performance. The performance boost from switching to 2x9/1x12 often approaches 2X.  Since no scenarios were found which benefit from the legacy configuration there are no plans to add extensions to modify MEDIA_POOL_STATE. 

 

Thread pool size vs. physical hardware configuration

There are 4 main areas to consider:

  • Optimal work group size is determined by thread pool configuration, not physical hardware.  The driver will automatically handle the thread launch details to maximize thread occupancy.  State tracking (such as branch masking) is handled at the pool and subslice level by the driver, but for the most part these are implementation details which can be ignored by applications.
  • Local memory is shared by threads in the same pool.  The  number of bytes reported by CL_DEVICE_LOCAL_MEM_SIZE is physically located in lowest level cache, not the subslice.  For Apollo Lake this means either 1 (for 1x12 HD Graphics 500) or 2 (for 2x9 HD Graphics 505) regions are reserved to be shared by all threads in the same workgroup.  

  • Workgroup Barriers: again, behavior is tied to the work group, which is defined by the thread pool. There are now two types of internal barrier implementations -- "local barriers" within a physical subslice and "linked barriers" spanning subslices.  This behavior happens automatically and cannot be changed by the application.  There are no additional knobs provided to optimize.  
  • Subgroup extensions:  subgroups are "between" work groups and work items, so their mapping to hardware remains unchanged.   Work items in a subgroup execute on the same EU thread.  For more info see Ben Ashbaugh's excellent section on subgroups in our extension tutorial.

 

Conclusion

In the past, thread pools were always configured to match physical hardware.  Now there is a notable exception due to optimizations increasing performance on Apollo Lake processors.  You won't need to make a lot of changes to use these optimizations.  The most important takeaway is that Intel has done the work behind the scenes to make efficient use of Apollo Lake capabilities easy.  The details in this article are provided as a conceptual background but everything happens under the hood.   These changes are completely transparent.  To your application HD Graphics 500 has 1 subslice with 12 EUs and HD Graphics 505 has 2 subslices with 9 EUs -- even though the underlying hardware is 2x6 and 3x6.  Extensive internal testing has shown that this internal driver optimization  provides big improvements.  We have not seen a case of performance regression yet.  However, we are always open to feedback.  If you find a scenario where the legacy thread pool configuration may be a better fit please let us know.

For more information, please see the Broxton Graphics Programmer's Reference Manual.


An Example of a Convolutional Neural Network for Image Super-Resolution

$
0
0

Convolutional neural networks (CNN) are becoming mainstream in computer vision. In particular, CNNs are widely used for high-level vision tasks, like image classification (AlexNet*, for example). This article (and associated tutorial) describes an example of a CNN for image super-resolution (SR), which is a low-level vision task, and its implementation using the Intel® Distribution for Caffe* framework and Intel® Distribution for Python*. This CNN is based on the work described by1 and2, proposing a new approach to performing single-image SR using CNNs.

Introduction

Some modern camera sensors, present in everyday electronic devices like digital cameras, phones, and tablets, are able to produce reasonably high-resolution (HR) images and videos. The resolution in the images and videos produced by these devices is in many cases acceptable for general use.

However, there are situations where the image or video is considered low resolution (LR). Examples include the following situations:

  1. Device does not produce HR images or video (as in some surveillance systems).
  2. The objects of interest in the image or video are small compared to the size of the image or video frame; for example, faces of people or vehicle plates located far away from the camera.
  3. Blurred or noisy images.
  4. Application using the images or videos demands higher resolution than that present in the camera.
  5. Improving the resolution as a pre-processing step improves the performance of other algorithms that use the images; face detection, for example.

Super-resolution is a technique to obtain an HR image from one or several LR images. SR can be based on a single image or on several frames in a video sequence.

Single-image (or single-frame) SR uses pairs of LR and HR images to learn the mapping between them. For this purpose, image databases containing LR and HR pairs are created3 and used as a training set. The learned mapping can be used to predict HR details in a new image.

On the other hand, multiple-frame SR is based on several images taken from the same scene, but from slightly different conditions (such as angle, illumination, and position). This technique uses the non-redundant information present in multiple images (or frames in an image sequence) to increase the SR performance.

In this article, we will focus on a single-image SR method.

Single-Image Super-Resolution Using Convolutional Neural Networks

In this method, a training set is used to train a neural network (NN) to learn the mapping between the LR and HR images in the training set. There are many references in the literature about SR. Many different techniques have been proposed and used for about 30 years. Methods using deep CNNs have been developed in the last few years. One of the first methods was created by1, who described a three-layer CNN and named it Super-Resolution Convolutional Neural Network (SRCNN). Their pioneering work in this area is important because, besides demonstrating that the mapping from LR to HR can be cast as a CNN, they created a model often used as a reference. New methods compare its performance to the SRCNN results. The same authors have recently developed a modified version of their original SRCNN, which they named Fast Super-Resolution Convolutional Neural Network (FSRCNN), that offers better restoration quality and runs faster2.

In this article, we describe both the SRCNN and the FSRCNN, and, in a separate tutorial, we show an implementation of the improved FSRCNN. Both the SRCNN and the FSRCNN can be used as a basis for further experimentation with other published network architectures, as well as others that the readers might want to try. Although the FSRCNN (and other recent network architectures for SR) show clear improvement over the SRCNN, the original SRCNN is also described here to show how this pioneer network has evolved from its inception to newer networks that use different topologies to achieve better results. In the tutorial, we will implement the FSRCNN network using the Intel Distribution for Caffe deep learning framework and Intel Distribution for Python, which will let us take advantage of Intel® Xeon® processors and Intel® Xeon Phi™ processors, as well as Intel® libraries to accelerate training and testing of this network.

Super-Resolution Convolutional Neural Network (SRCNN) Structure

The authors of the SRCNN describe their network, pointing out the equivalence of their method to the sparse-coding method4, which is a widely used learning method for image SR. This is an important and educational aspect of their work, because it shows how example-based learning methods can be adapted and generalized to CNN models.

The SRCNN consists of the following operations1:

  1. Preprocessing: Up-scales LR image to desired HR size.
  2. Feature extraction: Extracts a set of feature maps from the up-scaled LR image.
  3. Non-linear mapping: Maps the feature maps representing LR to HR patches.
  4. Reconstruction: Produces the HR image from HR patches.

Operations 2–4 above can be cast as a convolutional layer in a CNN that accepts as input the preprocessed images from step 1 above, and outputs the HR image. The structure of this SRCNN consists of three convolutional layers:

  • Input Image: LR image up-sampled to desired higher resolution and c channels (the color components of the image)
  • Conv. Layer 1: Patch extraction
    • n1 filters of size c× f1× f1
    • Activation function: ReLU (rectified linear unit)
    • Output: n1 feature maps
    • Parameters to optimize: c× f1× f1× n1 weights and n1 biases
  • Conv. Layer 2: Non-linear mapping
    • n2 filters of size n1× f2× f2
    • Activation function: ReLU
    • Output: n2 feature maps
    • Parameters to optimize: n1× f1× f1× n2 weights and n2 biases
  • Conv. Layer 3: Reconstruction
    • One filter of size n2× f3× f3
    • Activation function: Identity
    • Output: HR image
    • Parameters to optimize: n2× f3× f3× c weights and c biases
  • Loss Function: Mean squared error (MSE) between the N reconstructed HR images and the N original true HR images in the training set (N is the number of images in the training set).

In their paper, the authors of SRCNN implement and test their SRCNN using several settings varying the number of filters. They get better SR performance when they increase the number of filters, at the expense of increasing the number of parameters (weights and biases) to optimize, which in turns increases the computational cost. Next is their reference model, which shows good overall results in terms of accuracy/performance (Figure 1):

  • Input Image: LR single-channel image up-sampled to desired higher resolution
  • Conv. Layer 1: Patch extraction
    • 64 filters of size 1 x 9 x 9
    • Activation function: ReLU
    • Output: 64 feature maps
    • Parameters to optimize: 1 x 9 x 9 x 64 = 5184 weights and 64 biases
  • Conv. Layer 2: Non-linear mapping
    • 32 filters of size 64 x 1 x 1
    • Activation function: ReLU
    • Output: 32 feature maps
    • Parameters to optimize: 64 x 1 x 1 x 32 = 2048 weights and 32 biases
  • Conv. Layer 3: Reconstruction
    • 1 filter of size 32 x 5 x 5
    • Activation function: Identity
    • Output: HR image
    • Parameters to optimize: 32 x 5 x 5 x 1 = 800 weights and 1 bias

Figure 1. Structure of SRCNN showing parameters for reference model.

Fast Super-Resolution Convolutional Neural Network (FSRCNN) Structure

The authors of the SRCNN recently created a new CNN which accelerates the training and prediction tasks, while achieving comparable or better performance compared to SRCNN. The new FSRCNN consists of the following operations2:

  1. Feature extraction: Extracts a set of feature maps directly from the LR image.
  2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).
  3. Non-linear mapping: Maps the feature maps representing LR to HR patches. This step is performed using several mapping layers with filter size smaller than used in SCRNN.
  4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers, in order to more accurately produce the HR image.
  5. Deconvolution: Produces the HR image from HR features.

The authors explain in detail the differences between SRCNN and FSRCNN, but things particularly relevant for a quick implementation and experimentation (which is the scope of this article and the associated tutorial) are the following:

  1. FSRCNN uses multiple convolution layers for the non-linear mapping operation (instead of a single layer in SRCNN). The number of layers can be changed (compared to the author’s version) in order to experiment. Performance and accuracy of reconstruction will vary with those changes. Also, this is a good example for fine-tuning a CNN by keeping the portion of FSRCNN fixed up to the non-linear mapping layers, and then adding or changing those layers to experiment with different lengths for the non-linear LR-HR mapping operation.
  2. The input image is directly the LR image. It does not need to be up-sampled to the size of the expected HR image, as in the SRCNN. This is part of why this network is faster; the feature extraction stage uses a smaller number of parameters compared to the SRCNN.

As seen in Figure 2, the five operations shown above can be cast as a CNN using convolutional layers for operations 1–4, and a deconvolution layer for operation 5. Non-linearities are introduced via parametric rectified linear unit (PReLU) layers (described in5), which the authors for this particular model chose because of better and more stable performance, compared to rectified linear unit (ReLU) layers. See Appendix 1 for a brief description of ReLUs and PReLUs.

Figure 2. Structure of FSRCNN(56, 12, 4).

The overall best performing model reported by the authors is the FSRCNN (56, 12, 4) (Figure 2), which refers to a network with a LR feature dimension of 56 (number of filters both in the first convolution and in the deconvolution layer), 12 shrinking filters (the number of filters in the layers in the middle of the network, performing the mapping operation), and a mapping depth of 4 (the number of convolutional layers that implement the mapping between the LR and the HR feature space). This is the reason why this network looks like an hourglass; it is thick (more parameters) at the edges and thin (fewer parameters) in the middle. The overall shape of this reference model is symmetrical and its structure is as follows:

  • Input Image: LR single channel.
  • Conv. Layer 1: Feature extraction
    • 56 filters of size 1 x 5 x 5
    • Activation function: PReLU
    • Output: 56 feature maps
    • Parameters: 1 x 5 x 5 x 56 = 1400 weights and 56 biases
  • Conv. Layer 2: Shrinking
    • 12 filters of size 56 x 1 x 1
    • Activation function: PReLU
    • Output: 12 feature maps
    • Parameters: 56 x 1 x 1 x 12 = 672 weights and 12 biases
  • Conv. Layers 3–6: Mapping
    • 4 x 12 filters of size 12 x 3 x 3
    • Activation function: PReLU
    • Output: HR feature maps
    • Parameters: 4 x 12 x 3 x 3 x 12 = 5184 weights and 48 biases
  • Conv. Layer 7: Expanding
    • 56 filters of size 12 x 1 x 1
    • Activation function: PReLU
    • Output: 12 feature maps
    • Parameters: 12 x 1 x 1 x 56 = 672 weights and 56 biases
  • DeConv Layer 8: Deconvolution
    • One filter of size 56 x 9 x 9
    • Activation function: PReLU
    • Output: 12 feature maps
    • Parameters: 56 x 9 x 9 x 1 = 4536 weights and 1 bias

Total number of weights: 12464 (plus a very small number of parameters in PReLU layers)

Figure 3 shows an example of using the trained FSRCNN on one of the test images. The protobuf file describing this network, as well as training and testing data preparation and implementation details, will be covered in the associated tutorial.

Figure 3. An example of inference using a trained FSRCNN. The left image is the original. In the center, the original image was down-sampled and blurred. The image on the right is the reconstructed HR image using this network.

Summary

This article presented an overview of two recent CNNs for single-image super-resolution. The networks we chose were representative of the state of the art methods for SR and, having been one of the first published CNN-based methods, show interesting insights about how a non-CNN method (sparse coding) inspired a CNN-based method. In the tutorial, an implementation of FSRCNN is shown using the Intel® Distribution for Caffe* framework and Intel® Distribution for Python*. This reference implementation can be used to experiment with variations of this network and as a base for implementing newer networks for super-resolution that have been published recently. This is a good example for fine-tuning a network. New networks with varying architectures have been published recently. They show improvements in reconstruction or training/inference speed, and some of them attempt to solve the multi-frame SR problem. The reader is encouraged to experiment with these new networks.

Appendix 1: Rectified Linear Units (Rectifiers)

Rectified activation units (rectifiers) in neural networks are one way to introduce non-linearities in the network. A non-linear layer (also called activation layer) is necessary in a NN to prevent it from becoming a pure linear model with limited learning capabilities. Other possible activation layers are, among others, a sigmoid function or a hyperbolic tangent (tanh) layer. However, rectifiers have better computational efficiency, improving the overall training of the CNN.

The most commonly used rectifier is the traditional rectified linear unit (ReLU), which performs an operation defined mathematically as:

where xi is the input on the i-th channel.

Another rectifier introduced recently5 is the parametric rectified linear unit (PReLU), defined as:

which includes parameters pi controlling the slope of the line representing the negative inputs. These parameters will be learned jointly with the model during the training phase. To reduce the number of parameters, the pi parameters can be collapsed into one learnable parameter for all channels.

A particular case of the PReLU is the leaky ReLU (LReLU), which is a PReLU with pi defined as a small constant k for all input channels.

In Caffe, a PReLU layer can be defined (in a protobuf file) as

layer {
 name: "reluXX"
 type: "PReLU"
 bottom: "convXX"
 top: "convXX"
 prelu_param {
  channel_shared: 1
 }
}

Where, in this case, the negative slopes are shared across channels. A different option is to use LReLU with a fixed slope:

layer {
 name: "reluXX"
 type: "PReLU"
 bottom: "convXX"
 top: "convXX"
 prelu_param {
  filler: 0.1
 }
}

References

1. C. Dong, C. C. Loy, K. He and X. Tang, "Learning a Deep Convolutional Network for Image Super-Resolution," 2014.

2. C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," 2016.

3. P. B. Chopade and P. M. Patil, "Single and Multi Frame Image Super-Resolution and its Performance Analysis: A Comprehensive Survey," February 2015.

4. J. Yang, J. Wright, T. Huang and Y. Ma, "Image Super-Resolution via Sparse Representation,"IEEE Transactions on Image Processing, pp. 2861-2873, 2010.

5. K. He, X. Zhang, S. Ren and J. Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,"arxiv.org, 2015.

6. A. Greaves and H. Winter, "Multi-Frame Video Super-Resolution Using Convolutional Neural Networks," 2016.

7. J. Kim, J. K. Lee and K. M. Lee, "Accurate Image Super-Resolution Using Very Deep Convolutional Networks," 2016.

Getting Started with Intel® SDK for OpenCL™ Applications (Linux SRB4.1)

$
0
0

This article is a step by step guide to quickly get started developing applications using Intel®  SDK for OpenCL™ Applications in Linux for SRB4.1.  This is now a legacy release.  For instructions to install the latest release please see https://software.intel.com/articles/sdk-for-opencl-gsg.

  1. Install the driver
  2. Install the SDK
  3. Set up Eclipse

 

Step 1: Install the driver

To run applications using OpenCL kernels on the Intel Processor Graphics GPU device with the latest features for the newest processors, you will need a driver package from here: https://software.intel.com/en-us/articles/opencl-drivers.

(If your target processor does not include Intel Processor Graphics, install the latest runtime package instead.)

This script covers the steps needed to install the SRB4 driver package on Ubuntu 14.04.

To use

$ tar -xvf install_OCL_driver_ubuntu.tgz
$ sudo su
$ ./install_OCL_driver_ubuntu.sh

 

This script automates downloading prerequisites, installing the user-mode components, patching the 4.7 kernel, and building it. 

 

You can check your progress with the System Analyzer Utility.  If successful, you should see smoke test results looking like this at the bottom of the the system analyzer output:

--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

 

 

Step 2: Install the SDK

This script will set up all prerequisites for successful SDK install for Ubuntu.  After this, run the SDK installer.

Here is a kernel to test the SDK install:

__kernel void simpleAdd(
                       __global int *pA,
                       __global int *pB,
                       __global int *pC)
{
    const int id = get_global_id(0);
    pC[id] = pA[id] + pB[id];
}                               

Check that the command line compiler ioc64 is installed with

$ ioc64 -input=simpleAdd.cl -asm

(expected output)
No command specified, using 'build' as default
OpenCL Intel(R) Graphics device was found!
Device name: Intel(R) HD Graphics
Device version: OpenCL 2.0
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
fcl build 1 succeeded.
bcl build succeeded.

simpleAdd info:
	Maximum work-group size: 256
	Compiler work-group size: (0, 0, 0)
	Local memory size: 0
	Preferred multiple of work-group size: 32
	Minimum amount of private memory: 0

Build succeeded!

 

Step 3: Set up Eclipse

Intel SDK for OpenCL applications works with Eclipse Mars and Neon.

After installing, copy the CodeBuilder*.jar file from the SDK eclipse-plug-in folder to the Eclipse dropins folder.

$ cd eclipse/dropins
$ find /opt/intel -name 'CodeBuilder*.jar' -exec cp {} . \;

Start Eclipse.  Code-Builder options should be available in the main menu.

GPU Debugging: Challenges and Opportunities

$
0
0

From GPU Debugging: Challenges and Opportunities presented at the International Workshop on OpenCL (IWOCL) 2017.

GPU debugging support matches the OpenCL™ 2.0 GPU/CPU driver package for Linux* (64-bit) from OpenCL™ Drivers and Runtimes for Intel® Architecture with the notable exception of processors based on the Broadwell architecture. 

Basic concepts to keep in mind for GPU debugging

  • There are "host" and "target" components.  Host = where you interact with the debugger, target = where the application is run 
  • There are 3 components: gdbserver, the application to be debugged, and the gdb session.
  • The gdb session and application can be on the same or different machines
  • Breakpoints in the graphics driver can affect screen rendering.  You cannot debug on the same system that is rendering your screen.
  • However, non-graphical connections (such as with SSH) are unaffected. The "host" can be connected to remotely as well for gdb's text interface.

Abbreviations used:

KMD - Kernel Mode Driver

RT = OpenCL Runtime

DCD – Debug Companion Driver

  • Ring-0 driver, provides low-level gfx access
  • Run control flow, breakpoints, etc

DSL – Debug Support Library

  • Ring-3 debugger driver (shared library)
  • Loaded into the gdbserver process

DSL <--> DCD

  • Communicate via IOCTLs

How to set up a debugging session

The simplest option is to use ssh for steps 1,2, and 3.  However, gdb can be run locally as well.  The target steps (1 and 2) should be run remotely because GPU breakpoints can cause rendering hangs.

1. launch gdbserver

/usr/bin/gdbserver-igfx :1234 --attach 123

2. launch the application

export IGFXDBG_OVERRIDE_CLIENT_PID=123
./gemm

Note: there is an automatic breakpoint at the first kernel launch

3. launch GDB

source /opt/intel/opencl-sdk/gt_debugger_2016.0/bin/debuggervars.sh

/opt/intel/opencl-sdk/gt_debugger_2016.0/bin/launch_gdb.sh –tui

In GDB

target remote :1234
continue
x/i $pc

GDB should now be able to step through the kernel code.

OpenCL™ Drivers and Runtimes for Intel® Architecture

$
0
0

What to Download

By downloading a package from this page, you accept the End User License Agreement.

Installation has two parts:

  1. Intel® SDK for OpenCL™ Applications Package
  2. Driver and library(runtime) packages

The SDK includes components to develop applications: IDE integration, offline compiler, debugger, and other tools.  Usually on a development machine the driver/runtime package is also installed for testing.  For deployment you can pick the package that best matches the target environment.

The illustration below shows some example install configurations. 

 

SDK Packages

Please note: A GPU/CPU driver package or CPU-only runtime package is required in addition to the SDK to execute applications

Standalone:

Suite: (also includes driver and Intel® Media SDK)

 

Driver/Runtime Packages Available

GPU/CPU Driver Packages

CPU-only Runtime Packages  

 


Intel® SDK for OpenCL™ Applications 2016 R3 for Linux (64-bit)

This is a standalone release for customers who do not need integration with the Intel® Media Server Studio. It provides components to develop OpenCL applications for Intel processors. 

Visit https://software.intel.com/en-us/intel-opencl to download the version for your platform. For details check out the Release Notes.

Intel® SDK for OpenCL™ Applications 2016 R3 for Windows* (64-bit)

This is a standalone release for customers who do not need integration with the Intel® Media Server Studio. The standard Windows graphics driver packages contains the driver and runtime library components necessary to run OpenCL applications. This package provides components for OpenCL development. 

Visit https://software.intel.com/en-us/intel-opencl to download the version for your platform. For details check out the Release Notes.


OpenCL™ 2.0 GPU/CPU driver package for Linux* (64-bit)

 

The intel-opencl-r5.0 (SRB5.0) Linux driver package enables OpenCL 1.2 or 2.0 on the GPU/CPU for the following Intel® processors:

  • Intel® 5th, 6th or 7th generation Core™ processor
  • Intel® Celeron® Processor J3000 Series with Intel® HD Graphics 500 (J3455, J3355), Intel® Pentium® Processor J4000 Series with Intel® HD Graphics 505 (J4205), Intel® Celeron® Processor N3000 Series with Intel® HD Graphics 500 (N3350, N3450), Intel® Pentium Processor N4000 Series with Intel® HD Graphics 505 (N4200)
  • Intel® Xeon® v4, or Intel® Xeon® v5 Processors with Intel® Graphics Technology (if enabled by OEM in BIOS and motherboard)

Installation Instructions.  Scripts to automate install and additional install documentation available here.

Intel validates the intel-opencl-r5.0 driver on CentOS 7.2 and 7.3 when running the following 64-bit kernels:

  • Linux 4.7 kernel patched for OpenCL
  • Linux 4.4 kernel patched for  Intel® Media Server Studio 2017 R3

Although Intel validates and provides technical support only for the above Linux kernels on CentOS 7.2 and 7.3, other distributions may be adapted by utilizing our generic operating system installation steps as well as MSS 2017 R3 installation steps.  

In addition: Intel also validates Ubuntu 16.04.2 when running the following 64-bit kernel:

•Ubuntu 16.04.2 default 4.8 kernel

Ubuntu 16.04 with the default kernel works fairly well but some core features (i.e. device enqueue, SVM memory coherency, VTune support) won’t work without kernel patches.  This configuration has been minimally validated to prove that it is viable to suggest for experimental use, but it is not fully supported or certified.

Supported OpenCL devices:

  • Intel® graphics (GPU)
  • CPU

For detailed information please see the driver package Release Notes. 

Previous Linux driver packages:

Intel intel-opencl-r4.1 (SRB4.1) Linux driver packageInstallation instructions Release Notes
Intel intel-opencl-r4.0 (SRB4) Linux driver packageInstallation instructionsRelease Notes
SRB3.1 Linux driver packageInstallation instructionsRelease Notes

For Linux drivers covering earlier platforms such as 4th generation Intel Core processor please see the versions of Media Server Studio in the Driver Support Matrix.


OpenCL™ Driver for Iris™ graphics and Intel® HD Graphics for Windows* OS (64-bit and 32-bit)

The standard Intel graphics drivers for Windows* include components needed to run OpenCL* and Intel® Media SDK applications on processors with Intel® Iris™ Graphics or Intel® HD Graphics on Windows* OS.

You can use the Intel Driver Update Utility to automatically detect and update your drivers and software.  Using the latest available graphics driver for your processor is usually recommended.

 

Supported OpenCL devices:

  • Intel graphics (GPU)
  • CPU

For the full list of Intel® Architecture processors with OpenCL support on Intel Graphics under Windows*, refer to the Release Notes.

 


OpenCL™ Runtime for Intel® Core™ and Intel® Xeon® Processors

This runtime software package adds OpenCL CPU device support on systems with Intel Core and Intel Xeon processors.

Supported OpenCL devices:

  • CPU

Latest release (16.1.1)

Previous Runtimes (16.1)

Previous Runtimes (15.1):

For the full list of supported Intel® architecture processors, refer to the OpenCL™ Runtime Release Notes.

 


 Deprecated Releases

Note: These releases are no longer maintained or supported by Intel

OpenCL™ Runtime 14.2 for Intel® CPU and Intel® Xeon Phi™ Coprocessors

This runtime software package adds OpenCL support to Intel Core and Xeon processors and Intel Xeon Phi coprocessors.

Supported OpenCL devices:

  • Intel Xeon Phi coprocessor
  • CPU

Available Runtimes

For the full list of supported Intel architecture processors, refer to the OpenCL™ Runtime Release Notes.

An Example of a Convolutional Neural Network for Image Super-Resolution—Tutorial

$
0
0

This tutorial describes one way to implement a CNN (convolutional neural network) for single image super-resolution optimized on Intel® architecture from the Caffe* deep learning framework and Intel® Distribution for Python*, which will let us take advantage of Intel processors and Intel libraries to accelerate training and testing of this CNN.

The CNN we use in this tutorial is the Fast Super-Resolution Convolutional Neural Network (FSRCNN), based on the work described in [1] and [2], who proposed a new approach to perform single-image SR using CNNs. We describe in more detail this network and its predecessor (the Super-Resolution Convolutional Neural Network (SRCNN)) in an associated article (“An Example of a Convolutional Neural Network for Image Super-Resolution”).

FSRCNN Structure

As described in the associated article and in [2], the FSRCNN consists of the following operations:

  1. Feature extraction: Extracts a set of feature maps directly from the low-resolution (LR) image.
  2. Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).
  3. Non-linear mapping: Maps feature maps representing LR patches to high-resolution (HR) ones. This step is performed using several mapping layers with filter size smaller than the one used in SCRNN.
  4. Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers in order to more accurately produce the HR image.
  5. Deconvolution: Produces the HR image from HR features.

The structure of the FSRCNN (56, 12, 4) model (which is the best performing model reported in [2], and described in the associated article) is shown in Figure 1. It has a LR feature dimension of 56 (number of filters both in the first convolution and in the deconvolution layer), 12 shrinking filters (the number of filters in the layers in the middle of the network, performing the mapping operation), and a mapping depth of 4 (the number of convolutional layers that implement the mapping between the LR and the HR feature space).

Graphic showing structure of FSRCNN
Figure 1: Structure of the FSRCNN (56 ,12, 4).

Training and Testing Data Preparation

Datasets to train and test this implementation are available from the authors’ [2]  website. The train dataset consists of 91 images of different sizes. There are two test datasets: Set 5 (containing 5 images) and Set 14 (containing 14 images). In this tutorial, both train and test datasets will be packed into an HDF5* file (https://support.hdfgroup.org/), which can be efficiently used from the Caffe framework. For more information about Caffe optimized for Intel® architecture, visit Manage Deep Learning Networks with Caffe* Optimized for Intel® Architecture and Recipe: Optimized Caffe* for Deep Learning on Intel® Xeon Phi™ Processor x200.

Both train and test datasets need some preprocessing, as follows:

  • Train dataset: First, the images are converted to YCrCb color space (https://en.wikipedia.org/wiki/YCbCr), and only the luminance channel Y is used in this tutorial. Each of the 91 images in the train dataset is downsampled by a factor k, where k is the scaling factor desired for super-resolution, obtaining in this way a pair of corresponding LR and HR images. Next, each image pair (LR/HR) is cropped into a subset of small subimages, using stride s, so we end up with N pairs of LR/HR subimages for each one of the 91 original train images. The reason for cropping the images for training is that we want to train the model using both LR and HR local features located in a small area. The number of subimages, N, depends on the size of the subimages and the stride s. The authors of [2], for their experiments define a 7x7 pixels size for the LR subimages, and a 21x21 pixels size for the HR subimages, which corresponds to a scaling factor k=3.
  • Test dataset: Each image in the test dataset is processed in the same way as the training dataset, with the exception that the stride s can be larger than the one used for training, to accelerate the testing procedure.

The following Python code snippets show one possible way to generate the train and test datasets. We use OpenCV* (http://opencv.org/) to handle and preprocess the images. The first snippet shows how to generate the HR and LR subimage pair set from one of the original images in the 91-image train dataset for the specific case where scaling factor k=3 and stride = 19:

import os
import sys
import numpy as np
import h5py

sys.path.append('$CAFFE_HOME/opencv-2.4.13/release/lib/')
import cv2

# Parameters
scale = 3
stride = 19
size_ground = 19
size_input = 7
size_pad = 2

#Read image to process
image = cv2.imread('<PATH TO FILES>/Train/t1.bmp')

#Change color-space to YCR_CB
image_ycrcb = cv2.cvtColor(image, cv2.COLOR_RGB2YCR_CB)
image_ycrcb = image_ycrcb[:,:,0]
image_ycrcb = image_ycrcb.reshape((image_ycrcb.shape[0], image_ycrcb.shape[1], 1))

#Compute size of LR images and resize HR images to a multiple of scale
height_small = int(height/scale)
width_small  = int(width/scale)

image_pair_HR = cv2.resize(image_ycrcb, (width_small*scale, height_small*scale) )
image_pair_LR = cv2.resize(image_ycrcb, (width_small, height_small) )

# Declare tensors to hold 1024 LR-HR subimage pairs
input_HR = np.zeros((size_ground, size_ground, 1, 1024))
input_LR = np.zeros((size_input + 2*size_pad, size_input + 2*size_pad, 1, 1024))

height, width = image_pair_HR.shape[:2]

#Iterate over the train image using the specified stride and create LR-HR subimage pairs
count = 0
for i in range(0, height-size_ground+1, stride):
    for j in range(0, width-size_ground+1, stride):
       subimage_HR = image_pair_HR[i:i+size_ground, j:j+size_ground]
       count = count + 1
       height_small = size_input
       width_small  = size_input
       subimage_LR = cv2.resize(subimage_HR, (width_small, height_small) )

       np.lib.pad(subimage_LR, ((size_pad, 2), (2, 2)), 'constant', constant_values=(0.0))
       input_HR[:,:,0,count-1] = subimage_HR
       input_LR[:,:,0,count-1] = np.lib.pad(subimage_LR, ((size_pad, 2), (2, 2)), 'constant', constant_values=(0.0))

The next snippet shows how to use the python h5py module to create an hdf5 file that contains the HR and LR subimage pair set created in the previous snippet:

(…)
#Create an hdf5 file
with h5py.File('train1.h5','w') as H5:
    H5.create_dataset( 'Input', data=input_LR )
    H5.create_dataset( 'Ground', data=input_HR )
(…)

The previous two snippets can be used to create the hdf5 file containing the entire training set of 91 images to be used for training in Caffe.

FSRCNN Training

The reference model (described in the previous section) is implemented using Intel® Distribution for Caffe, which has been optimized to run on Intel CPUs. An introduction to the basics of this framework and directions to install it can be found at the Intel® Nervana AI Academy.

In Caffe, models are defined using protobuf files. The FSRCNN model can be downloaded from the authors’ [2] website. The code snippet below shows the input layer and the first convolutional layer of the FSRCNN (56, 12, 4) model defined by its authors [2]. The input layer reads the train/test data from the files whose filenames are defined in the source files located in the $HOME_CAFFE/examples directory (train.txt and test.txt). The batch size for training is 128.

name: "SR_test"
layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "label"
  hdf5_data_param {
    source: "examples/FSRCNN/train.txt"
    batch_size: 128
  }
  include: { phase: TRAIN }
}
layer {
  name: "data"
  type: "HDF5Data"
  top: "data"
  top: "label"
  hdf5_data_param {
    source: "examples/FSRCNN/test.txt"
    batch_size: 2
  }
  include: { phase: TEST }
}

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 0.1
  }
  convolution_param {
    num_output: 56
    kernel_size: 5
    stride: 1
    pad: 0
    weight_filler {
      type: "gaussian"
      std: 0.0378
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
(...)

To train the above model, the authors of [2] provide in their website a solver protobuf file containing the training parameters and the location of the protobuf network definition file:

# The train/test net protocol buffer definition
net: "examples/FSRCNN/FSRCNN.prototxt"
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 5000
# The base learning rate, momentum and the weight decay of the network.
#base_lr: 0.005
base_lr: 0.001
momentum: 0.9
weight_decay: 0
# Learning rate policy
lr_policy: "fixed"
# Display results every 100 iterations
display: 1000
# Maximum number of iterations
max_iter: 1000000
# write intermediate results (snapshots)
snapshot: 5000
snapshot_prefix: "examples/FSRCNN/RESULTS/FSRCNN-56_12_4"
# solver mode: CPU or GPU
solver_mode: CPU

The solver shown above will train the network defined in the model definition file FSRCNN.prototxt using the following parameters:

  • The test interval will be every 5000 iterations, and 100 is the number of forward passes the test should perform.
  • The base learning rate will be 0.005, and the learning rate policy is fixed, which means the learning rate will not change with time. Momentum is 0.9 (a common choice) and weight_decay is zero (no regularization to penalize large weights).
  • Intermediate results (snapshots) will be written to disk every 5000 iterations, and the maximum number of iterations (when the training will stop) is 1000000.
  • Snapshot results will be written to the examples/FSRCNN/RESULTS directory (assuming we run Caffe from the install directory $CAFFE_ROOT). Model files (containing the trained weights) will be pre-fixed by the string ‘FSRCNN-56_12_4’.

The reader is encouraged to experiment with different parameters. One useful option is to define a small maximum number of iterations and explore how the test error decreases, and compare this rate between different sets of parameters.

Once the network definition and solver files are ready, start training by running the caffe command located in the build/tools directory:

export CAFFE_ROOT=< Path to caffe >
$CAFFE_ROOT/build/tools/caffe train -engine "MKL2017"–solver \ $CAFFE_ROOT/examples/FSRCNN//FSRCNN_solver.prototxt 2>$CAFFE_ROOT/examples/FSRCNN/output.log

Resume Training Using Saved Snapshots

After training the CNN, the network parameters (weights) will be written to disk according to the frequency specified by the snapshot parameter. Caffe will create two files at each snapshot:

FSRCNN-56_12_4_iter_1000000.caffemodel
FSRCNN-56_12_4_iter_1000000.solverstate

The model file contains the learned model parameters corresponding to the indicated iteration, serialized as binary protocol buffer files. The solver state file is the state snapshot containing all the necessary information to recover the solver state at the time of the snapshot. This file will let us resume training from the snapshot instead of restarting from scratch. For example, let us assume we ran training for 1 million iterations, and after that we realize that we need to run it for an extra 500K iterations to further reduce the testing error. We can restart the training using the snapshot taken after 1 million iterations:

$CAFFE_ROOT/build/tools/caffe train -engine "MKL2017"–solver\ $CAFFE_ROOT/examples/FSRCNN//FSRCNN_solver.prototxt –snapshot\ $CAFFE_ROOT/examples/FSRCNN/RESULTS/FSRCNN-56_12_4_iter_1000000.solverstate\ 2>$CAFFE_ROOT/examples/FSRCNN/output_resume.log

So, the new training will run until the new number of iterations specified in the solver file is reached, which in this case is 1500000.

FSRCNN Testing Using Pre-Trained Parameters

Once we have a trained model, we can use it to perform super-resolution on an input LR image. We can test the network at any moment during the training as long as we have model snapshots already generated.

In practice, we can use the super-resolution model we trained to increase the resolution on any image or video. However, for the purposes of this tutorial, we want to test our trained model in a LR image for which we have an HR image to compare with. To this effect, we will use a sample image from the test dataset that is used in [1] and [2] (from the Set5 dataset, which is also commonly used to test SR models in other publications).

To perform the test, we will use a sample image (butterfly) as the ground truth. To create the input LR image, we will blur and downsample the ground truth image, and will use it to feed the trained network. Once we forward-run the network with the input image, obtaining a super-resolved image as output, we will compare the three images (ground truth, LR, and super-resolved) to visually evaluate the performance of the SR network we trained.

The test procedure described above can be implemented in several ways. As an example, the following Python script implements the testing procedure using the OpenCV library for image handling:

	 import os
     import sys
     import numpy as np

     #Set up caffe root directory and add to path
     caffe_root = '$APPS/caffe/'
     sys.path.insert(0, caffe_root + 'python')
     sys.path.append('opencv-2.4.13/release/lib/')

    import cv2
    import caffe

    # Parameters
    scale = 3

    #Create Caffe model using pretrained model
    net = caffe.Net(caffe_root + 'FSRCNN_predict.prototxt',
                      caffe_root + 'examples/FSRCNN/RESULTS/FSRCNN-56_12_4_iter_300000.caffemodel', caffe.TRAIN)

    #Input directories
    input_dir = caffe_root + 'examples/SRCNN/DATA/Set5/'

    #Input ground truth image
    im_raw = cv2.imread(caffe_root + '/examples/SRCNN/DATA/Set5/butterfly.bmp')

    #Change format to YCR_CB
    ycrcb = cv2.cvtColor(im_raw, cv2.COLOR_RGB2YCR_CB)
    im_raw = ycrcb[:,:,0]
    im_raw = im_raw.reshape((im_raw.shape[0], im_raw.shape[1], 1))

    #Blur image and resize to create input for network
    im_blur = cv2.blur(im_raw, (4,4))
    im_small = cv2.resize(im_blur, (int(im_raw.shape[0]/scale), int(im_raw.shape[1]/scale)))

    im_raw = im_raw.reshape((1, 1, im_raw.shape[0], im_raw.shape[1]))
    im_blur = im_blur.reshape((1, 1, im_blur.shape[0], im_blur.shape[1]))
    im_small = im_small.reshape((1, 1, im_small.shape[0], im_small.shape[1]))

    im_comp = im_blur
    im_input = im_small

    #Set mode to run on CPU
    caffe.set_mode_cpu()

    #Copy input image data to net structure
    c1,c2,h,w = im_input.shape
    net.blobs['data'].data[...] = im_input

    #Run forward pass
    out = net.forward()

    #Extract output image from net, change format to int8 and reshape
    mat = out['conv3'][0]
    mat = (mat[0,:,:]).astype('uint8')

    im_raw = im_raw.reshape((im_raw.shape[2], im_raw.shape[3]))
    im_blur = im_blur.reshape((im_blur.shape[2], im_blur.shape[3]))
    im_comp = im_blur.reshape((im_comp.shape[2], im_comp.shape[3]))

    #Display original (ground truth), blurred and restored images
    cv2.imshow("image",im_raw)
    cv2.imshow("image2",im_comp)
    cv2.imshow("image3",mat)
    cv2.waitKey()

    cv2.destroyAllWindows()

Running the above script on the test image displays the output shown in Figure 2. Readers are encouraged to try this network and refine the parameters to obtain better super-resolution results.

 Grayscale samples comparison of butterfly wing after FSRCNN
Figure 2: Testing the trained FSRCNN. The left image is the ground truth. The image in the center is the ground truth after being blurred and downsampled. The image on the right is the super-resolved image using a model snapshot after 300000 iterations.

Summary

In this short tutorial, we have shown how to train and test a CNN for super-resolution. The CNN we described is the Fast Super-Resolution Convolutional Neural Network (FSRCNN) [2], which is described in more detailed in in an associated article (“An Example of a Convolutional Neural Network for Image Super-Resolution”). This particular CNN was chosen for this tutorial because of its relative simplicity, good performance, and the importance of the authors’ work in the area of CNNs for super-resolution. Several new CNN architectures for super-resolution have been described in the literature recently, and several of them compare their performance to the FSRCNN or its predecessor, created by the same authors: the SRCNN [1].

The training and testing in this tutorial was performed using Intel® Xeon® processors and Intel® Xeon Phi™ processors, using the Intel Distribution for Caffe deep learning framework and Intel Distribution for Python, which are optimized to run on Intel Xeon processors and Intel Xeon Phi processors.

Deep learning-based image/video super-resolution is an exciting development in the field of computer vision. Readers are encouraged to experiment with this network, as well as newer architectures, and test with their own images and videos. To start using Intel’s optimized tools for machine learning and deep learning, visit Intel® Developer Zone (Intel® DZ).

Bibliography

[1] C. Dong, C. C. Loy, K. He and X. Tang, "Learning a Deep Convolutional Network for Image Super-Resolution," 2014.

[2] C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," 2016.

Setting Up Intel® Ethernet Flow Director

$
0
0

Introduction

Intel® Ethernet Flow Director (Intel® Ethernet FD) directs Ethernet packets to the core where the packet consuming process, application, container, or microservice is running. It is a step beyond receive side scaling (RSS) in which packets are sent to different cores for interrupt processing, and then subsequently forwarded to cores on which the consuming process is running.

Intel Ethernet FD supports advanced filters that direct received packets to different queues, and enables tight control on flow in the platform. It matches flows and CPU cores where the processing application is running for flow affinity, and supports multiple parameters for flexible flow classification and load balancing. When operating in Application Targeting Routing (ATR) mode, Intel Ethernet FD is essentially the hardware offloaded version of Receive Flow Steering available on Linux* systems, and when running in this mode, Receive Packet Steering and Receive Flow Steering are disabled.

It provides the most benefit on Linux bare-metal usages (that is, not using virtual machines (VMs)) where packets are small and traffic is heavy. And because the packet processing is offloaded to the network interface card (NIC), Intel Ethernet FD could be used to avert denial-of-service attacks.

Supported Devices

Intel Ethernet FD is supported on devices that use the ixgbe driver, including the following:

  • Intel® Ethernet Converged Network Adapter X520
  • Intel® Ethernet Converged Network Adapter X540
  • Intel® Ethernet Controller 10 Gigabit 82599 family

It is also supported on devices that use the i40e driver:

  • Intel® Ethernet Controller X710 family
  • Intel® Ethernet Controller XL710 family

DPDK includes support for Intel Ethernet FD on the devices listed above. See the DPDK documentation for how to use DPDK and testpmd with Intel Ethernet FD.

In order to determine whether your device supports Intel Ethernet FD, use the ethtool command with the --show-features or -k parameter on the network interface you want to use:

# ethtool --show-features <interface name> | grep ntuple

Screenshot of using ethool command to detect Intel Flow Director support.

If the ntuple-filters feature is followed by off or on, Intel Ethernet FD is supported on your Ethernet adapter. However, if the ntuple-filters feature is followed by off [fixed], Intel Ethernet FD is not supported on your network interface.

Enabling Intel® Ethernet Flow Director

Driver Parameters for Devices Supported by the ixgbe Driver

On devices that are supported by the ixgbe driver, there are two parameters that can be passed-in when the driver is loaded into the kernel that will affect Intel Ethernet FD:

  • FdirPballoc
  • AtrSampleRate 

FdirPballoc

This driver parameter specifies the packet buffer size allocated to Intel Ethernet FD. The valid range is 1–3, where 1 specifies that 64k should be allocated for the packet buffer, 2 specifies a 128k packet buffer, and 3 specifies a 256k packet buffer. If this parameter is not explicitly passed to the driver when it is loaded into the kernel, the default value is 1 for a 64k packet buffer.

AtrSampleRate

The AtrSampleRate parameter indicates how many Tx packets will be skipped before a sample is taken. The valid range is from 0 to 255. If the parameter is not passed to the driver when it is loaded into the kernel, the default value is 20, meaning that every 20th packet will be sampled to determine if a new flow should be created. Passing a value of 0 will disable ATR mode, and no samples will be taken from the Tx queues.

The above driver parameters are not supported on devices that use the i40e driver.

To enable these parameters, first unload the ixgbe module from the kernel. Note, if you are connecting to the system over ssh, this may disconnect your session:

# rmmod ixgbe

Then re-load the ixgbe driver into the kernel with the desired parameters listed above:

# modprobe ixgbe FdirPballoc=3,2,2,3 AtrSampleRate=31,63,127,255

Note that, in this example, for each parameter there are four values. This is because on my test system, I have two network adapters that are using the ixgbe driver--an Intel Ethernet Controller 10 Gigabit 82599, and an Intel® Ethernet Controller 10 Gigabit X540--each of which has two ports. The order in which the parameters are applied is in PCI Bus/Device/Function order. To determine the PCI BDF order on your system, use the following command:

# lshw -c network -businfo

Screenshot of lshw command showing PCI Bus, Device Function information for NICs

Based on this system configuration, using the modprobe command above, the Intel Ethernet Controller 10 Gigabit X540-AT2 port at PCI address 00:03.0 is allocated the FdirPballoc and AtrSampleRate parameters of 3 and 31, respectively, and the Intel Ethernet Controller 10 Gigabit 82599 port at PCI address 81:00.1 is allocated the FdirPballoc and AtrSampleRate parameters of 3 and 255, respectively.

Once you have determined that your Intel branded server network adapter supports Intel Ethernet FD and you have loaded the desired parameters into the driver (on supported models), execute the following command to enable Intel Ethernet FD:

# ethtool --features enp4s0f0 ntuple on

Screenshot of using ethtool command to turn Intel Flow Director on

Because the commands below only indicate which Rx queue a matched packet should be sent to, ideally an additional step should be taken to pin both Rx queues and the process, application, or container that is consuming the network traffic to the same CPU. Pinning an application/process/container to a CPU is beyond the scope of this document, but it can be done using the taskset command. Pinning IRQs to a CPU can be done using the set_irq_affinity script that is included with the freely available sources of the i40e and ixgbe drivers. See Intel Support: Drivers and Software for the latest versions of these drivers. See also the IRQ Affinity section in this tuning guide for how to set IRQ affinity.

Using Intel Ethernet Flow Director

Intel Ethernet FD can run in one of two modes: externally programmed (EP) mode, and ATR mode. Once Intel Ethernet FD is enabled as shown above, ATR mode is the default mode, provided that the driver is in multiple Tx queue mode. When running in EP mode, the user or management/orchestration software can manually set how flows are handled. In either mode, fields are intelligently selected from the packets in the Rx queues to index into the Perfect-Match filter table. For more information on how Intel Ethernet FD works, see this whitepaper.

Application Targeting Routing

In ATR mode, Intel Ethernet FD uses fields from the outgoing packets in the Tx queues to populate the 8K-entry Perfect-Match filter table. The fields that are selected depend on the packet type; for example, fields to filter TCP traffic will be different than those used to filter user diagram protocol (UDP) traffic. Intel Ethernet FD then uses the Perfect-Match filter table to intelligently route incoming traffic to the Rx queues.

To disable ATR mode and switch to EP mode, simply use the ethtool command shown under Adding Filters to manually add a filter, and the driver will automatically enter EP mode. To automatically re-enable ATR mode, use the ethtool command under Removing Filters until the Perfect-Match filter table is empty.

Externally Programmed Mode

When Intel Ethernet FD runs in EP mode, flows are manually entered by an administrator or by management/orchestration software (for example, OpenFlow*). As mentioned above, once enabled, Intel Ethernet FD automatically enters EP mode when a flow is manually entered using the ethtool command listed under Adding Filters.

Adding Filters

The following commands illustrate how to add flows/filters to Intel Ethernet FD using the -U,
-N, or --config-ntuple switch to ethtool.

To specify that all traffic from 10.23.4.6 to 10.23.4.18 be placed in queue 4, issue this command:

# ethtool --config-ntuple flow-type tcp4 src-ip 10.23.4.6 dst-ip 10.23.4.18 action 4

Note: Without the ‘loc’ parameter, the rule is placed at position 1 of the Perfect-Match filter table. If a rule is already in that position, it is overwritten.

Forwards to queue 2 all IPv4 TCP traffic from 192.168.10.1:2000 that is going to 192.168.10.2:2001, placing the filter at position 33 of the Perfect-Match filter table (and overwriting any rule currently in that position):

# ethtool --config-ntuple <interface name> flow-type tcp4 src-ip 192.168.10.1 dst-ip 192.168.10.2 src-port 2000 dst-port 2001 action 2 loc 33

Drops all UDP packets from 10.4.83.2:

# ethtool --config-ntuple flow-type udp4 src-ip 10.4.82.2 action -1

Note: The VLAN field is not a supported filter with the i40e driver (Intel Ethernet Controller XL710 and Intel Ethernet Controller X710 NICs).

For more information and options, see the ethtool man page documentation on the -U, -N, or --config-ntuple option.

Note: The Intel Ethernet Controller XL710 and the Intel Ethernet Controller X710, of the Intel® Ethernet Adapter family, provide extended cloud filter flow support for more complex cloud networks. For more information on this feature, please see the Cloud Filter Support section in this ReadMe document, or in the ReadMe document in the root folder of the i40e driver sources.

Removing Filters

In EP mode, to remove a filter from the Perfect-Match filter table, execute the following command against the appropriate interface. ‘N’ in the rule below is the numeric location in the table that contains the rule you want to delete:

# ethtool --config-ntuple <interface name> delete N

Listing Filters

To list the filters that have been manually entered in EP mode, execute the following command against the desired interface:

# ethtool --show-ntuple <interface name>

Disabling Intel Ethernet Flow Director

Disabling Intel Ethernet FD is done with this command:

# ethtool --features  enp4s0f0 ntuple off

This flushes all entries from the Perfect-Filter flow table.

Conclusion

Intel Ethernet FD directs Ethernet packets to the core where the packet consuming process, application, container, or microservice is running. This functionality is a step beyond RSS, in which packets are simply sent to different cores for interrupt processing, and then subsequently forwarded to cores on which the consuming process is running. It can be explicitly programmed by administrators and control plane management software, or it can intelligently sample outgoing traffic and automatically create Perfect-Match filters for incoming packets. When operating in automatic ATR mode, Intel Ethernet FD is essentially the hardware offloaded version of Receive Flow Steering available on Linux systems.

Intel Ethernet FD can provide additional performance benefit, particularly in workloads where packets are small and traffic is heavy (for example, in Telco environments). And because it can be used to filter and drop packets at the network interface card (NIC), it could be used to avert denial-of-service attacks.

 

Resources

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82599-10-gbe-controller-datasheet.pdf

https://downloadmirror.intel.com/26556/eng/README.txt

https://downloadmirror.intel.com/26713/eng/Readme.txt

https://downloadmirror.intel.com/22919/eng/README.txt

http://dpdk.org/doc/guides/howto/flow_bifurcation.html

http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/xl710-sr-iov-config-guide-gbe-linux-brief.pdf

http://software.intel.com/en-us/videos/creating-virtual-functions-using-sr-iov

Also, view the ReadMe file found in the root directory of both the i40e and ixgbe driver sources.

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation

Cannot Connect to Intel® Flexlm* License Server Due to Firewall

$
0
0

Problem

License check-out received error on Client system:

 INTEL: Cannot connect to license server system. (-15,570:115 "Operation now in progress")

User could telnet to server by server port: 28518. Server port: 28518 was opened in firewall.


Root Cause

When Intel(R) Flexlm license server starts, there are two server daemon running:

One is FlexNet Publisher* license server daemon, which used default 28518 port or the one set in SERVER line of the license file.

The other one is Intel(R) Software License Manager Vendor Daemon, which used TCP/IP port number specified on the VENDOR line of the license file. Normally this number is omitted. You may find the actual number from the server log. Depending on your operating system, the server log files are located on Windows*: <install drive>:\program files\common files\intel\flexlm\iflexlmlog.txt, or Linux* or OS*: <install location of servers>/lmgrd.log. You may find lines like followings from the log file:

... (INTEL) (@INTEL-SLOG@) === Network Info ===
... (INTEL) (@INTEL-SLOG@) Socket interface: IPV4
... (INTEL) (@INTEL-SLOG@) Listening port: 49163
... (INTEL) (@INTEL-SLOG@) Daemon select timeout (in seconds):

In this example, listening port 49163 is the default TCP/IP port for Intel VENDOR Daemon.

Because the later listening port: 49163 was not opened in Firewall, we cannot connect to the Vendor daemon and received error during license check-out on Client machines.


Solution

To connect to license server with Firewall enabled, you must add exceptions to open both listening ports of FlexNet Publisher* license server daemon and Intel(R) Software License Manager Vendor Daemon. In this case, opening port 49163 besides 28518 resolved the problem.

Intel’s FLEXlm specifies two ports:

1. SERVER host_name host_id port1 -- This one is specified in the product license for 28518)

2. VENDOR INTEL port=port2 -- Usually we specify port1 to 28518 and port2 is omitted (then system will choose one randomly).

However, you may specify port2 to a fixed value and open that port too on the firewall.


Pixeldash Studios Diversifies to Succeed.

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Pixeldash Studios Diversifies to Succeed. Get more game dev news and related topics from Intel on VentureBeat.

Screenshot of fast motorcycle running into a car crash, in a winter snowy pass

The state of Louisiana—the Pelican State, apparently—was not really considered a hotbed of game development when Jason Tate and Evan Smith were working for the only studio in Baton Rouge. When that company folded, it became a crossroads moment: move to the Bay Area or any other gaming hotbed, or stay where they were and make the best of the situation in their home state.

“We opted to start our own company so we could make games, and stay in Louisiana,” says Tate, the co-founder and lead programmer, of the decision he made with Smith, who would act as Creative Director on the two-man team.

This decision in 2011 was assisted by the state government’s efforts to build the games industry, which had even seen representatives visit E3 to alert indie companies to the tax programs and other opportunities being offered to attract teams. “We’re part of an incubator called the Louisiana Technology Park where we now have a very cool facility, it’s very affordable, and six years later there are seven indie companies under one roof,” says Tate.

What makes the Pixel Dash story a little different is that despite just being two people, they worked on client projects to supplement funds raised for their game, Road Redemption, through Kickstarter, the Humble Store, and eventually Steam* Early Access. “In the beginning, it was basically two people doing the job of 10 people,” says Smith, “but it was about building something in the community that would have some longevity, so we didn’t do just one project and bet everything on that.”

Having seen and been a part of a team that had bet on one horse and ultimately fizzled, this team was determined to ensure it had legs. “It’s been an interesting journey over five years, bouncing around projects like corporate client work, e-learning, and taking our gaming skills over to training simulations,” says Smith.

Yes, Pixel Dash has crafted apps and other software for diverse topics like student preparedness at Louisiana State University (creating an app that gamified job searches, including elements where you follow a create-your-own-adventure path and could learn at the end where you did well or went wrong. The tool would even advise on outfits so that students would be prepared for the ‘business casual’ world.)

While for a small team this has meant that work on its prime gaming project may have taken some time, it has kept the lights on and the process running. It hasn’t been a one-way street of gamifying corporate apps or programs, however.

“One of the benefits of working with clients is the game design expertise can travel over to the corporate world, but you can learn things from that side, too, that you can then channel back to your game side,” says Tate.

The purpose of all this is to make sure that Road Redemption sees a full release. For an indie game to have started development back in 2011 and hitting Kickstarter* in 2013, that’s a long cycle. But due to this diversified client list, the game will see full release, and that’s the main goal of every developer. “The combination of Kickstarter and Early Access has been the reason we could fund a project of this size. All the funds flow straight back into development,” says Smith.

Screenshot of fast speed motorcycle rooftop chase with guns
Above: Bikes, speed, guns, and rooftop courses give Road Redemption real visual style.

Spirit of Road Rash

It may not surprise you to learn that a game titled Road Redemption is designed in the spirit of the classic road beat-em-up, Road Rash. Inspired by games like Twisted Metal and even Skitchin’ that was released on the Sega Genesis, it would seem from community feedback and continued engagement with the game through the fundraising campaigns, that there is a passionate audience for this kind of road combat game.

“We wanted to add new stuff like projectile weapons and guns, and we were surprised how polarizing that was,” says Smith. “Some of the purists said ‘there are no guns in Road Rash, there should be none here’.”

“There’s a camp of people very vocal about staying true to the original and another camp excited about something different,” adds Tate.

Community engagement and interacting with YouTube* influencers and streamers has also kept Road Redemption in the minds of gamers. “It fits well for YouTubers because a lot of ridiculous things happen, so we’ve had PewDiePie cover us twice, and others. We’ve certainly been trying to build relationships…and have over 50 million combined YouTube video views,” says Smith.

It’s a vital marketing outlet for a team without the manpower or budget to follow more traditional advertising methods. Instead, they rely on the forums where gamers discuss their tactics and share reviews as simple as “that was awesome…Santa Claus smacking someone in the face with a shovel.” It also allowed them to balance certain aspects using community feedback and consider requested modes.

Of course, a couple of handy tactics involved appealing to the egos of some prominent influencers. “One thing we did early on was put a lot of the YouTube personalities names in the game, so they were encouraged to go find themselves in the game,” says Smith. Crafty!

Screenshot of high speed motorcycle chase, at night. Biker with a pumpkin on his head attacks other with a baseball bat
Above: Beating rivals with a baseball bat from the back of a bike…don’t try this at home, kids.

“As internet comments go there’s always a lot of hate, so when you see that one person who seems enamored, it really pushes you forward to do well,” says Smith. Following this format also allowed them to balance certain aspects using community feedback and consider requested modes. “One mode was to have to beat the game with just one health. We think that’s going to take ages, and then within a few days on the forums, people are talking about having beaten it,” says Tate. Yeah, they do that, but with the time to make tweaks and follow the myriad comments, the result should be a game that has the longevity the company hopes to achieve.

It’s quite an achievement that may have fallen at the first hurdle if not for the combination of support from the state, the ease of entry through digital distribution, and emerging funding opportunities.

That’s something important for this team in the state of Louisiana. Pixel Dash has taken on interns in paid and course credit programs from LSU, and many of them have turned into full-time employees.

It’s a far cry from the opportunities Smith saw several years ago. “Growing up in Louisiana, it’s not something we thought would be here for us. Personally, it’s like ‘wow, this is a thing now’. We can develop games here…we have dev kits and are working for console, and it would never have happened years ago.”

It’s happening now.

How Embree Delivers Uncompromising Photorealism

$
0
0

Introduction

Rendering is the process of generating final output from a collection of data that defines the geometry of objects, the materials they are made from, and the light sources in a 3D scene.

Rendering is a computationally demanding task, involving calculations for many millions of rays as they travel through the scene and interact with the materials on every surface. With full global illumination (GI), light bounces from surface to surface, changing intensity and color as it goes, and it may also be reflected and refracted, and even absorbed or scattered by the volume of an object rather than just interacting with its surface.

3D rendering of an expensive car

Rendering is used in industrial applications as diverse as architectural and product visualization, animations, visual effects in movies, automotive rendering, and more.

For all these applications, the content creator may vary from being a lone freelancer to an employee working for a multi-million dollar companies with hundreds of employees. The hardware used can be anything from a single machine to many hundreds of machines, any of which may be brand new or a decade old.

No matter the size of the company or what hardware they use, 3D artists are expected to create photoreal output within tight and demanding deadlines.

Corona Renderer*, developed by Render Legion a.s., is a leading rendering solution that meets all these diverse needs.

Logo for Corona

The Challenges

3D rendering of an opulent room

End users require absolute realism in the final output, but they also need unrivaled speed and stability due to the strict deadlines involved in their work.

Handling the complex calculations needed to generate the final output is only half the battle. A modern render engine must also increase productivity through ease-of-use. In other words, the software must allow users to work as well as render faster. As a result, end users need real-time previews to assess changes in lighting and materials, adjustments in the point of view, or changes to the objects themselves.

Each industry also has its own specific needs, and every user has a set of preferred tools. A render engine must be compatible with as wide a range of tools and plug-ins as possible.

Development costs for the render engine have to be managed, so that the final product is affordable for both single users and large firms.

The render engine must also work on as wide a variety of hardware as possible and scale across setups, from single laptops to render farms consisting of hundreds of multiprocessor machines.

The Solution

3d rendering of a large modern airplane on a sunny runway

Corona Renderer uses Embree ray tracing kernels to carry out the intensive computations necessary in rendering. This ensures that end users get the best speed and performance and the highest level of realism in their final output.

Using Embree ray tracing kernels gives another benefit: the development team is freed from having to optimize these calculations themselves, which means they can use their time and talents to meet other demands such as:

  • Creating a simple and intuitive user interface
  • Ensuring compatibility with a wide range of tools and plug-ins
  • Meeting the specialized needs for each industry
  • Creating code that allows seamless scaling on setups of a single machine through to hundreds of machines

The developers of Corona Renderer use Intel processor-based machines for coding and testing, which helps ensure that the development and testing environments are similar to the hardware used by most end users (90 percent of them work and render on Intel processor-based machines), and gives the developers the most stable, reliable, and best-performing environment for their work.

Close collaboration with the Embree team at Intel means that the Corona Renderer developers get the best results from Intel technology, a benefit that is passed directly on to Corona Renderer end users.

Results – With Embree versus without Embree

Application: Path tracer renderer using Embree ray tracing kernels

Description: Path tracer renderer integrated into various 3D modeling software packages

Highlights: Corona Renderer is a production-quality path tracer renderer, used in industrial applications such as architectural and product visualization, animation for film and television, automotive rendering, and more. It uses Embree ray tracing kernels to accelerate geometry processing and rendering.

End-user benefits: Speed, reliability, and ease-of-use in creating production-quality images and animations. Interactive rendering provides the same capabilities for real-time viewing while working with a scene as featured in GPU-based renderers but with none of the drawbacks and limitations of a GPU-based solution.

Comparisons: The same scene was rendered with and without Embree, with results averaged from five machines with different processors (from fastest rendering times on a dual Intel® Xeon® processor E5-2690 @ 2.9 GHz to slowest rending times on an Intel® Core™ i7-3930K processor @ 4.2GHz).

graphics comparison

graphics comparison

Results – Comparing Intel® Xeon® Processor Generations

3d rendering of a tank

Application: Path tracer renderer benchmarking application, using Embree ray tracing kernels

Description: Path tracing app for use in benchmarking hardware performance

Highlights: Corona Renderer is a production-quality path tracer renderer, used in industrial applications such as architectural and product visualization, animation for film and television, automotive rendering, and more. It uses Embree ray tracing kernels to accelerate geometry processing and rendering.

End-user benefits: Ability to compare performance of different computer hardware configurations, in particular different brands and generations of CPUs.

The application allows end users to test their own system, while an online database of results allows the user to look up the performance of other configurations.

Comparisons:

graphics comparison
Each generation of Intel® Xeon® processors offers significant improvement in rendering performance, with the Intel® Xeon® processor E5 v4 family processing roughly twice as many rays per second as the Intel® Xeon® processor E5 v2 family.

graphics comparison
Each generation of Intel® Xeon® processors offers significant improvement in rendering time, with the Intel® Xeon® processor E5 v4 family processing being roughly twice as fast as the Intel® Xeon® processor E5 v2 family.

The GPU Question - Speed

There is an ongoing debate about whether GPUs or CPUs provide the best solution for rendering.

The many thousands of cores that a GPU-based rendering solution offers may sound like an advantage, but at best this is only true with relatively simple scenes. As scene complexity increases, the more sophisticated architecture of CPUs takes the lead, providing the best rendering performance.

Corona Renderer uses full GI for true realism—something that is often disabled in GPU renderers for previews and even for final renders. This lack of full GI is behind some of the claims of the speed of GPU renderers. While you can disable this true calculation of light bounces throughout a scene with CPU-based solutions, you don’t really need to, since CPUs don’t struggle with these calculations in the same way GPU solutions do.

3d rendering of a bedroom with textures

GPUs gain their benefits when each of their thousands of cores is performing a similar type of calculation and when “what needs to be calculated next” is well known. However, when handling the millions of rays bouncing through a 3D scene, each core may have to do a very different calculation, and there will be many logic branches that will need to be accounted for. This is where the sophisticated architecture of a CPU pulls ahead, thanks to the more flexible and adaptive scheduling of processes across its cores.

GPU Speed Comparison

Test Setup

An interior scene was created, illuminated only by environment lighting entering the scene through small windows. This is a particularly challenging situation, as most of the lighting in the scene is indirect, coming from light bouncing throughout the scene.

To standardize across the different renderers, only their default materials were used, and the environment lighting was a single color. The render engines were left at their default settings as much as possible, although where relevant the GPU render engines were changed from defaults to use full GI.

The Corona Renderer was set to run for 2 minutes and 30 seconds, which included scene parsing and render time. Since the cost of the single NVIDIA GTX* 1080 card in the test setup is roughly half the cost of the Intel® Core™ i7-6900K processor, the GPU engines were set to run for 5 minutes, approximating the effect of having two GTX 1080 cards to give a comparable measure of performance-per-cost.

Hardware

The same PC was used for each test:
Corona Renderer: Intel Core i7-6900K processor, 3.2 GHz base, 3.7 GHz turbo
GPU engines: NVIDIA GTX 1080, 1.6 GHz base, 1.7 GHz turbo

Results

3d rendering of a bedroom without textures

3d rendering of a bedroom without textures

3d rendering of a bedroom without textures

Despite running for twice as long, both GPU engines showed significant noise in the results and did not approach the results shown by the CPU-based Corona Renderer, which had very little noise remaining in half the time.

By using Corona Renderer’s denoising feature, the same 2 minutes and 30 seconds (the last 5 seconds used for denoising rather than rendering) gives an image that is almost completely free of noise.

3d rendering of a bedroom without textures

Speed Conclusion

At best, the speed benefits of a GPU-based solution only apply to simpler scenes. As the path that the lighting follows increases in complexity, the more sophisticated CPU architecture takes a clear lead.

Many of the claims of GPU rendering speed are based on artificial simplifications of a scene, such as disabling full light bouncing, clamping highlights, and so on. CPU-based solutions can also implement these simplifications, but performance is so good that they are not required.

Other CPU versus GPU Considerations

Stability

CPU-based solutions lead the way in terms of stability and reliability, factors that can be critical in many industries and are not reliant on the stability of frequently updated graphics card drivers.

Compatibility

Most 3D software has a wide range of plug-ins or shaders available that expand on the inbuilt functionality or materials. CPU rendering solutions offer the widest compatibility with these plug-ins, some of which are integral to the workflow of certain industries.

Also, many companies, and even freelancers, turn to commercial render farms to deliver content within the tight deadlines set by their clients. Render farms use many hundreds of machines to accelerate rendering. While there are many long-established render farms supporting CPU-based solutions like Corona Renderer, far fewer farms exist that support GPU-based renderers.

3D rendering of an elegant courtyard

Interactive Rendering

The ability to see changes in your scene without having to start and stop a render has become critical to the workflow of many artists. This kind of real-time rendering is not unique to GPU-based solutions, however - since its release, Corona Renderer has included Interactive Rendering that provides this exact functionality, allowing a user to move an object, change the lighting, alter a material, move a camera, and so on, and see the results of that change immediately.

The result shown in the Interactive Renderer is identical to the final render, including full GI and any post-processing, and can be displayed in a separate window (the Corona VFB) or even in the viewport of the host 3D application as shown below:

3d rendering of a terrace and it&#039;s 3d environment

Hardware - Networking

Even for freelancers, it is common practice to have a network of machines to use as a local render farm.

Building a multiple-GPU solution can take special hardware and knowledge, and many of the claims of the high performance of GPU-based solutions come from users who have specialized setups that support four or more graphics cards.

With a CPU-based solution, anyone can create a similar network without needing any specialized knowledge or hardware. Any computer from the last decade can be added to the rendering network, thanks to Corona Renderer’s inbuilt Distributed Rendering. If the machines are on the same network, you can use auto-discovery to add them without any manual setup at all, while those on a different network can be added by simply adding their IP addresses to a list. In both cases, those machines can then be used to assist in accelerating rendering beyond what a single machine can do.

This ability is also reflected in the availability of render farms and cloud-based rendering services, which allow users to submit their renders to many hundreds of machines for even faster processing. For CPU-based render engines, users can choose from many farms, while only a handful of farms offer similar services for GPU-based render engines.

This means that CPU-based renderers make it easy to create a farm of rendering machines, even for freelancers or hobby-level users, and of course each machine can still be used as a computer in its own right, while networked GPUs are only useful during rendering.

Hardware - Upgrading

With upgrading a CPU, the benefits are realized across all applications. Upgrading a GPU on the other hand only offers benefits for a few select applications and uses - money invested in GPU hardware almost exclusively benefits only your render times, while money invested in an upgraded CPU will benefit every aspect of your workflow.

Hardware - Memory

The maximum RAM directly available on a graphics card is limited by the current technology, with the most expensive cards at the time of writing (in the region of USD 2,500) offering a maximum of 32 GB of memory. CPU-based solutions on the other hand can easily and affordably have two, three or four times as much directly accessible RAM as the “latest and greatest” graphics cards at a fraction of the cost.

While GPU render engines may be able to access out-of-core memory (that is, memory not on the graphics card itself), this capability often results in a reduction of rendering performance. With a CPU-based solution like Corona Renderer, there is much less need to worry about optimizing a scene to reduce polygon counts, textures, and so on.

Also, upgrading a GPU is all-or-nothing: if more memory on the graphics card itself is required, to avoid using out-of-core memory for example, you must replace the entire card. With a CPU-based solution, adding more memory is simple and again benefits every application, not just rendering.

3D redenring of inner gears of a timepiece

GPU versus CPU Summary

Corona Renderer remains CPU-based, ensuring the widest compatibility, greatest stability, easiest hardware setup, and best performance as scenes scale in complexity. Ninety percent of Corona Renderer end users choose Intel processor-based solutions.

Thanks to the power of modern CPUs and the Embree ray tracing kernels, Corona Renderer can offer all the benefits claimed by GPU-based render engines, but with none of the drawbacks.

Conclusion

Corona Renderer continues to grow in market share thanks to its combination of rendering speed and simple user interface that allows users to take an artistic rather than technical approach to their work without any loss of speed, power, or performance.

By ensuring the best possible rendering performance, using the Embree ray tracing kernels lets you look beyond optimizing the rendering process and focus instead on ease of use and continuing to innovate, introducing unique features such as LightMix, Interactive Rendering, the standalone Corona Image Editor, inbuilt scattering tools, and more.

Despite recent developments in the field of GPUs, CPU-based solutions remain the best, and sometimes only, solution for a great many companies, freelancers, and individuals.

3d rendering of a elegant car in a wet rainy environment

Learn More

Corona Renderer home page:

https://corona-renderer.com/

Embree ray tracing kernels:

https://embree.github.io/

About Corona Renderer and Render Legion a.s.

Corona Renderer is a CPU-based rendering engine initially developed for 3ds Max*, and also available as a standalone command-line interface application. It is currently being ported for Cinema 4D*.

It began as a solo student project by Ondřej Karlík at the Czech Technical University in Prague, evolving into a full-time commercial project in 2014 after Ondřej established a company along with former CG artist Adam Hotový and Jaroslav Křivánek, an associate professor and researcher at Charles University in Prague.

It was first released commercially in February 2015, and since then Render Legion has grown to more than 15 members.

Corona Renderer’s focus has been mainly on polishing the ArchViz feature set, and future plans are for further specialization to meet the needs of the automotive, product design, and VFX industries.

What's New? - Intel® VTune™ Amplifier XE 2017 Update 4

$
0
0

Intel® VTune™ Amplifier XE 2017 performance profiler

A performance profiler for serial and parallel performance analysis. Overviewtrainingsupport.

New for the 2017 Update 4! (Optional update unless you need...)

As compared to 2017 Update 3:

  • General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
  • Support for Microsoft Windows* 10 Creators Update (RS2) 

Resources

  • Learn (“How to” videos, technical articles, documentation, …)
  • Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
  • Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

Contents

File: vtune_amplifier_xe_2017_update4.tar.gz

Installer for Intel® VTune™ Amplifier XE 2017 for Linux* Update 4

File: VTune_Amplifier_XE_2017_update4_setup.exe

Installer for Intel® VTune™ Amplifier XE 2017 for Windows* Update 4 

File: vtune_amplifier_xe_2017_update4.dmg

Installer for Intel® VTune™ Amplifier XE 2017 - OS X* host only Update 4 

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

How Yahoo! JAPAN Used Open vSwitch* with DPDK to Accelerate L7 Performance in Large-Scale Deployment Case Study

$
0
0

View PDF [783 KB]

As cloud architects and developers know, it can be incredibly challenging to keep up with the rapidly increasing cloud infrastructure demands of both users and services. Many cloud providers are looking for proven and effective ways to improve network performance. This case study discusses one such collaborative project undertaken between Yahoo! JAPAN and Intel in which Yahoo! JAPAN implemented Open vSwitch* (OvS) with Data Plane Development Kit (OvS with DPDK) to deliver up to 2x practical cloud application L7 performance improvement while successfully completing a more than 500-node, large-scale deployment.

Introduction to Yahoo! JAPAN

Yahoo! JAPAN is a Japanese Internet company that was originally formed as a joint venture between Yahoo! Inc. and Softbank. The Yahoo! JAPAN portal is one of the most frequently visited websites in Japan, and its many services have been running on OpenStack* Private Cloud since 2012. Yahoo! JAPAN receives over 69 billion monthly page views, of which more than 39 billion come from smartphones alone. Yahoo! JAPAN also has over 380 million total app downloads, and it currently runs more than 100 services.

Network Performance Challenges

As a result of rapid cloud expansion, Yahoo! JAPAN began observing some network bottlenecks in its environment beginning in 2015. At that time, both cloud resources and users were doubling year by year, causing a rapid increase in virtual machine (VM) density. Yahoo! JAPAN was also noticing huge spikes in network traffic and burst traffic when breaking news, weather updates, or public service announcements, related to an earthquake for example, would happen. This dynamic put an additional burden on the network environment.

As these network performance challenges arose, Yahoo! JAPAN began experiencing some difficulties meeting service-level agreements (SLAs) for its many services. Engineers from the network infrastructure team at Yahoo! JAPAN noticed that noisy VMs (also known as “noisy neighbors”) were disrupting the network environment.

When that phenomenon occurs, a rogue VM may monopolize bandwidth, disk I/O, CPU, and other resources, which then impacts other VMs and applications in the environment.

Yahoo! JAPAN also noticed that the compute nodes were processing a large volume of short packets and that the network was handling a very heavy load (see Figure 1). Consequently, decreased network performance was affecting the SLAs.

Figure 1. A compute node showing a potential network bottleneck in a virtual switch.

Yahoo! JAPAN determined that its cloud infrastructure required a higher level of network performance in order to meet its application and SLAs. In the course of its research Yahoo! JAPAN had noticed that the Linux* Bridge overrun counter was increasing, which meant that the cause of its network difficulties was located in the network kernel. As a result, the company decided it needed to find a new solution to meet its needs going forward.

About OvS with DPDK

OvS with DPDK could be a potential solution to such network performance issues in cloud environments that are already using OpenStack Cloud, since it features OvS as a virtual switch. Native OvS uses kernel space for packet forwarding, which imposes a performance overhead and can limit network performance. DPDK, however, accelerates packet forwarding by bypassing the kernel.

DPDK integration with OvS offers other beneficial performance enhancements as well. For example, DPDK’s Poll Mode Driver eliminates context switch overhead. DPDK also uses direct user memory access to and from the NIC to eliminate kernel-user memory copy overhead. Both optimizations can greatly boost network performance. Overall, DPDK maintains compatibility with OvS while accelerating packet forwarding performance. Refer to Intel Developer Zone’s article, Open vSwitch with DPDK Overview, for more information.

Collaboration between Intel and Yahoo! JAPAN

As Yahoo! JAPAN was encountering network performance issues, Intel suggested that the company consider OvS with DPDK since it was now possible to use the two technologies in combination with one another. Yahoo! JAPAN was already aware that DPDK offered network performance benefits for a variety of telecommunications use cases but, being a web-based company, the company thought that it would not be able to take advantage of that particular solution. After discussing the project with Intel and learning about ways in which the technologies could work for a cloud service provider, Yahoo! JAPAN decided to try OvS with DPDK in their OpenStack environment.

For optimal performance deployment in OvS with DPDK, Yahoo! JAPAN enabled 1 GB hugepages. This step was important from a performance perspective, because it enabled Yahoo! JAPAN to reduce Translation Lookaside Buffer (TLB) misses and prevent page faults. The company also paid special attention to its CPU affinity design, carefully identifying ideal resource settings for each function. Without that step, Yahoo! JAPAN would not have been able to ensure stable network performance.

OpenStack’s Mitaka release offered the features required for Yahoo! JAPAN’s OvS with DPDK implementation, so the company decided to build a Mitaka cluster running with the configurations mentioned above. The first cluster includes over 150 nodes and uses Open Compute Project (OCP) servers.

Benchmark Test Results

Yahoo! JAPAN achieved impressive performance results after implementing OvS with DPDK in its cloud environment. To demonstrate these gains, the engineers measured two benchmarks: the network layer (L2) and the application layer (L7).

Table 1. Benchmark test configuration.

Hardware

Software

CPU

Intel® Xeon™ processor E5-2683 v3 2S

Host OS

CentOS* 7.2

Memory

512 GB DDR4-2400 RDIMM

Guest OS

CentOS 7.2

NIC

Intel® Ethernet Converged Network Adapter X520-DA2

OpenStack*

Mitaka

 

 

QEMU*

2.6.2

 

 

Open vSwitch

2.5.90 + TSO patch (a6be657)

 

 

Data Plane Development Kit

16.04

Figure 2. L2 network benchmark test.

L2 Network Benchmark Test Results

In the L2 benchmark test, Yahoo! JAPAN used Ixia IxNetwork* as a packet generator. Upon measuring L2 performance (see Figure 2), Yahoo! JAPAN observed 10x network throughput performance improvement in its short packet traffic. The company also found that OvS with DPDK reduced latency up to ~1/20x (1/20th). With these results, Yahoo! JAPAN successfully confirmed that OvS with DPDK accelerates the L2 path to the VM. These results were about in line with what Yahoo! JAPAN expected to find, as telecommunications companies had achieved similar results in their benchmark tests.

L7 Network Benchmark Test Results

The L7 single VM benchmark results for the application layer, however, exceeded Yahoo! JAPAN’s expectations. In this test, Yahoo! JAPAN instructed one VM to send a query and another VM to return a response. All applications (HTTP, MQ, DNS, RDB) demonstrated significant performance gains in this scenario (see Figure 3). Particularly in the MySQL* sysbench result, Yahoo! JAPAN saw simultaneous improvement in two important metrics: 1.5x better throughput (transaction/sec) and 1/1.5x less latency (response time).

Figure 3. Various application benchmark test results.

Application Benchmark Test Results

Why did network performance improve so dramatically? In the case of HTTP, for example, Yahoo! JAPAN saw a 2.0x improvement in OvS with DPDK when compared to Linux Bridge. Yahoo! JAPAN determined that this performance metric improved because OvS with DPDK reduces the number of context switches by 45 percent when compared with Linux Bridge.

The benchmark results for RabbitMQ* revealed another promising discovery. When Yahoo! JAPAN ran their first stress test on RabbitMQ under Linux Bridge, it observed degraded performance. When it ran the same stress test under OvS with DPDK, the application environment maintained a much more consistent and satisfactory level of performance (see Figure 4).

Figure 4. RabbitMQ stress test results.

RabbitMQ Stress Test Results

How was this possible? In both tests, noisy conditions created a high degree of context switching. In the Linux Bridge world, it’s necessary to pay a 50 percent tax to the kernel. But in the OvS with DPDK world, that tax is only 10 percent. This is because OvS with DPDK suppresses context switching, which prevents network performance from degrading even under challenging real world conditions. Yahoo! JAPAN found that CPU pinning relaxes interference between multiple noisy neighbor VMs and the critical OvS process, which also contributed to the performance improvements observed in this test. Which world would you want to live in: Linux Bridge or OvS with DPDK?

Ultimately, Yahoo! JAPAN found that OvS with DPDK delivers terrific network performance improvements for cloud environments. This finding was key to resolving Yahoo! JAPAN’s network performance issues and meeting the company’s SLA requirements.

Summary

Despite what you might think, deploying OvS with DPDK is actually not so difficult. Yahoo! JAPAN is already successfully using this technology in a production system with over 500 nodes. OvS with DPDK offers powerful performance benefits and provides a stable network environment, which enables Yahoo! JAPAN to meet its SLAs and easily support the demands placed on its cloud infrastructure. The impressive results that Yahoo! JAPAN has achieved through its implementation of OvS with DPDK can be enjoyed by other cloud service providers too.

When assessing whether OvS with DPDK will meet your requirements, it is important to carefully investigate what is causing the bottlenecks in your cloud environment. Once you fully understand the problem, you can identify which solution will best fit your specific needs.

To accomplish this task, Yahoo! JAPAN performed a thorough analysis of its network traffic before deciding how to proceed. The company learned that there was a high volume of short packets traveling throughout its network. This discovery indicated that OvS with DPDK might be a good solution for its problem, since OvS with DPDK is known to improve performance in network environments where a high volume of short packets is present. For this reason, Yahoo! JAPAN concluded that it is necessary to not only benchmark your results but also have a full understanding of your network’s characteristics in order to find the right solution.

Now that you’ve learned about the performance improvements that Yahoo! JAPAN achieved by implementing OvS with DPDK, have you considered deploying OvS with DPDK within your own cloud? To learn more about enabling OvS with DPDK on OpenStack, read these articles: Using Open vSwitch and DPDK with Neutron in DevStack, Using OpenSwitch with DPDK, and DPDK vHost User Ports.

Acknowledgment

Thanks to this successful collaboration with Intel, Yusuke Tatsumi, network engineer for Yahoo! JAPAN’s infrastructure team, said: “We found out that the OvS and DPDK combination definitely improves application performance for cloud service providers. It strengthened our cloud architecture and made it more robust.” Yahoo! JAPAN is pleased to have demonstrated that OvS with DPDK is a valuable technology that can achieve impressive network performance results and meet the demanding daily traffic requirements of a leading Japanese Internet company.

About the Author

Rose de Fremery is a New York-based writer and technologist. She is the former Managing Editor of The Social Media Monthly, the world's first print magazine devoted to the social media revolution. Rose currently writes about a range of business IT topics including cloud infrastructure, VoIP, UC, CRM, business innovation, and teleworking.

Notices

Testing conducted on Yahoo! JAPAN. Testing done by Yahoo! JAPAN.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation.

What is OpenCV?

$
0
0

OpenCV is a software toolkit for processing real-time image and video, as well as providing analytics, and machine learning capabilities.

Development Benefits

Using OpenCV, a BSD licensed library, developers can access many advanced computer vision algorithms used for image and video processing in 2D and 3D as part of their programs. The algorithms are otherwise only found in high-end image and video processing software.

Powerful Built-In Video Analytics

Video analytics is much simpler to implement with OpenCV API’s for basic building blocks such as background removal, filters, pattern matching and classification.

Real-time video analytics capabilities include classifying, recognizing, and tracking: objects, animals, people, specific features such as vehicle number plates, animal species, and facial features such as faces, eyes, lips, chin, etc.

Hardware and Software Requirements

OpenCV is written in Optimized C/C++, is cross-platform by design and works on a wide variety of hardware platforms, including Intel Atom® platform, Intel® Core™ processor family, and Intel® Xeon® processor family.

Developers can program OpenCV using C++, C, Python*, and Java* on Operating Systems such as Windows*, many Linux* distros, Mac OS*, iOS* and Android*.

Although some cameras work better due to better drivers, if a camera has a working driver for the Operating System in use, OpenCV will be able to use it.

Hardware Optimizations

OpenCV takes advantage of multi-core processing and OpenCL™. Hence, OpenCV can also take advantage of hardware acceleration if integrated graphics is present.

OpenCV v3.2.0 release can use Intel optimized LAPACK/BLAS included in the Intel® Math Kernel Libraries (Intel® MKL) for acceleration. It can also use Intel® Threading Building Blocks (Intel® TBB) and Intel® Integrated Performance Primitives (Intel® IPP) for optimized performance on Intel platforms.

OpenCV uses the FFMPEG library and can use Intel® Quick Sync Video technology to accelerate encoding and decoding using hardware.

OpenCV and IoT

OpenCV has a wide range of applications in traditional computer vision applications such as optical character recognition or medical imaging.

For example, OpenCV can detect Bone fractures1. OpenCV can also help classify skin lesions and help in the early detection of skin melanomas2.

However, OpenCV coupled with the right processor and camera can become a powerful new class of computer vision enabled IoT sensor. This type of design can scale from simple sensors to multi-camera video analytics arrays. See Designing Scalable IoT Architectures for more information.3

IoT developers can use OpenCV to build embedded computer vision sensors for detecting IoT application events such as motion detection or people detection.

Designers can also use OpenCV to build even more advanced sensor systems such as face recognition, gesture recognition or even sentiment analysis as part of the IoT application flow.

IoT applications can also deploy OpenCV on Fog nodes at the Edge as an analytics platform for a larger number of camera based sensors.

For example, IoT applications use camera sensors with OpenCV for road traffic analysis, Advanced Driver Assistance Systems (ADAS)3, video surveillance4, and advanced digital signage with analytics in visual retail applications5.

OpenCV Integration

When developers integrated OpenCV with a neural-network backend, it unleashed the true power of computer vision. Using this approach, OpenCV works with Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) to allow developers to build innovative and powerful new vision applications.

To target multiple hardware platforms, these integrations need to be cross platform by design. Hardware optimization of deep learning algorithms breaks this design goal. The OpenVX architecture standard proposes resource and execution abstractions.

Hardware vendors can optimize implementations with a strong focus on specific platforms. This allows developers to write code that is portable across multiple vendors and platforms, as well as multiple hardware types.

Intel® Computer Vision SDK (Beta) is an integrated design framework and a powerful toolkit for developers to solve complex problems in computer vision. It includes Intel’s implementation of the OpenVX API as well as custom extensions. It supports OpenCL custom kernels and can integrate CNN or DNN.

The pre-built and included OpenCV binary has hooks for Intel® VTune™Amplifier for profiling vision applications.

Getting Started:

Try this tutorial on basic people recognition.  Also, see OpenCV 3.2.0 Documentation for more tutorials.

Related Software:

Intel® Computer Vision SDK - Accelerated computer vision solutions based on OpenVX standard, integrating OpenCV and deep learning support using the included Deep Learning (DL) Deployment Toolkit.

Intel® Integrated Performance Primitives (IPP) - Programming toolkit for high-quality, production-ready, low-level building blocks for image processing, signal processing, and data processing (data compression/decompression and cryptography) applications.

Intel® Math Kernel Library (MKL) - Library with accelerated math processing routines to increase application performance.

Intel® Media SDK - A cross-platform API for developing media applications using Intel® Quick Sync Video technology.

Intel® SDK for OpenCL™ Applications - Accelerated and optimized application performance with Intel® Graphics Technology compute offload and high-performance media pipelines.

Intel® Distribution for Python* - Specially optimized Python distribution for High-Performance Computing (HPC) with accelerated compute-intensive Python computational packages like NumPy, SciPy, and scikit-learn.

Intel® Quick Sync Video - Leverage dedicated media processing capabilities of Intel® Graphics Technology to decode and encode fast, enabling the processor to complete other tasks and improving system responsiveness.

Intel® Threading Building Blocks (TBB) - Library for shared-memory parallel programming and intra-node distributed memory programming.

References:

  1. Bone fracture detection using OpenCV
  2. Mole Investigator: Detecting Cancerous Skin Moles Through Computer Vision
  3. Designing Scalable IoT Architectures
  4. Advanced Driver Assistance Systems (ADAS)
  5. Smarter Security Camera: A Proof of Concept (PoC) Using the Intel® IoT Gateway
  6. Introduction to Developing and Optimizing Display Technology
Viewing all 1142 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>