Performance Optimization of memcpy in DPDK

June 26, 2017, 7:10 am

Latest and popular articles on Intel Technologies

≫ Next: Configure SR-IOV Network Virtual Functions in Linux* KVM*

≪ Previous: Train and Use a TensorFlow* Model on Intel® Architecture

Introduction

Memory copy, memcpy, is a simple yet diverse operation, as there are possibly hundreds of code implementations that copy data from one part of memory to another. However, the discussion on how to evaluate and optimize for a better memcpy never stops.

This article discusses how optimizations are positioned, conducted, and evaluated for use with memcpy in the Data Plane Development Kit (DPDK).

First, let’s look at the following simple memcpy function:

void * simple_memcpy(void *dst, const void *src, size_t n)
{
        const uint8_t *_src = src;
        uint8_t *_dst = dst;
        size_t i;

        for (i = 0; i < n; ++i)
                _dst[i] = _src[i];

        return dst;
}

Is there anything wrong with this function? Not really. But it surely missed some optimization methods. The function:

Does not employ single instruction, multiple data (SIMD)
Has no instruction-level parallelism
Lacks load/store address alignment

The performance of the above implementation depends entirely on the compiler’s optimization. Surprisingly, in some scenarios, this function outperforms the glibc memcpy. Of course, the compiler takes most of the credit by optimizing the implementation. But it also gets us thinking: Is there an ultimate memcpy implementation that outperforms all others?

This article holds the view that the ultimate memcpy implementation, providing the best performance in any given scenario (hardware + software + data) simply does not exist. Ironically, the best memcpy implementation is to completely avoid memcpy operations; the second-best implementation might be to handcraft dedicated code for each and every memcpy call, and there are others. Memcpy should not be considered and measured as one standalone part of the program; instead, the program should be seen as a whole—the data that one memcpy accesses has been and will be accessed by other parts of the program, also the instructions from memcpy and other parts of the program interact inside the CPU pipeline in an out-of-order manner. This is why DPDK introduced rte_memcpy, to accelerate the critical memcpy paths in core DPDK scenarios.

Common Optimization Methods for memcpy

There are abundant materials online for memcpy optimization; we provide only a brief summary of optimization methods here.

Generally speaking, memcpy spends CPU cycles on:

Data load/store
Additional calculation tasks (such as address alignment processing)
Branch prediction

Common optimization directions for memcpy:

Maximize memory/cache bandwidth (vector instruction, instruction-level parallel)
Load/store address alignment
Batched sequential access
Use non-temporal access instruction as appropriate
Use the String instruction as appropriate to speed up larger copies

Most importantly, all instructions are executed through the CPU pipeline; therefore, pipeline efficiency is everything, and the instruction flow needs to be optimized to avoid pipeline stall.

Optimizing Memcpy for DPDK

Since early 2015, the exclusive memcpy implementation for DPDK, rte_memcpy, has been optimized several times to accelerate different DPDK use-case scenarios, such as vhost Rx/Tx. All the analysis and code changes can be viewed in the DPDK git log.

There are many ways to optimize an implementation. The simplest and most straightforward way is trial and error; to make a variety of improvements with some baseline knowledge, verify them in the target scenario, and then choose a better one by using a set of evaluation criteria. All you need is experience, patience, and a little imagination. Although this approach can sometimes bring surprises, it is neither efficient nor reassuring.

Another common approach sounds more promising: At first, initial effort is invested to fully understand the source code (assembly code, if necessary) behaviors and to establish the theoretical optimal performance. With this optimal baseline, the performance gap can be confirmed. Runtime sampling is then conducted to analyze defects of the existing code to seek improvement. This may require a lot of experience and analysis effort. For example, the vhost enqueue optimization in DPDK 16.11 is the result of several weeks work spent sampling and analyzing. Finally, by moving three lines of code, tests performed with DPDK testpmd showed that enqueue efficiency was improved by 1.7 times as the enqueue cost is reduced from about 250 cycles per packet to 150 cycles per packet. Later, in DPDK 17.02, the rte_memcpy optimization patch was derived from the same idea. This is hard to achieve by the first method.

See Appendix A for a description of the test hardware configuration we used for testing. To learn more about DPDK performance testing with testpmd, read Testing DPDK Performance and Features with TestPMD on Intel® Developer Zone.

There are many useful tools for profiling and sampling such as perf and VTune™. They are very effective as long as you know what data you’re looking for.

Show Me the Data!

Ultimately, the goal of optimization is to speed up the performance of the target application scenario, which is a combination of hardware, software, and data. The evaluation methods vary.

For memcpy, the use of a micro-benchmark can easily get a few key performance numbers such as the copy rate (MB/s); however, that approach lacks reference values. That’s because memcpy is normally optimized at the programming language level as well as the instruction level for the specific hardware platform and specific software code, even specific data length; and the memcpy algorithm itself doesn’t have much space for improvement. In this case, different scenarios require different optimization techniques. Therefore, micro-benchmarks speak only for themselves.

Also, it is not advisable to evaluate performance by timestamping the memcpy code. The modern CPU has a very complex pipeline that supports prefetching and out-of-order execution, which results in significant deviations when the performance is measured at the cycle level. Although forced synchronization can be done by adding serialization instructions, it may change the execution sequence of instruction flow, degrade program performance, and breach the original intention of performance measurement. Meanwhile, instructions which are highly optimized by the compiler also appear to be out-of-order with respect to the programming language. Forced sequential compiling also significantly impacts performance and makes the result meaningless. Besides, the execution time of an instruction stream includes not only the ideal execution cycles, but also the data access latency caused by pipeline stall cycles. Since the data accessed by a piece of code probably has been and will be accessed by other parts of the program, it may appear to have a shorter execution time by advancing or delaying the data access. These complex factors make the seemingly easy task of memcpy performance evaluation troublesome.

Therefore, field testing should be used for optimization evaluation. For example, in the application of Open vSwitch* (OvS) in the cloud, memcpy is heavily used in vhost Rx/Tx, and in this case we should take the final packet forwarding rate as the performance evaluation criteria for the memcpy optimization.

Test Results

Figure 1 below shows the test results of an example with Physical-VM-Physical (PVP) traffic close to the actual application scenario. And by replacing rte_memcpy in DPDK vhost with memcpy provided by glibc, comparative data is gained. The results show that an increase of 22 percent of the total bandwidth can be obtained simply by accelerating the vhost Rx/Tx part by the use of rte_memcpy. Our test configuration is described below in Appendix A.

colored column show comparisons of performances in DPDK rte_memcpy and glibc memcpy in OvS-DPDK — Figure 1. Performance comparison between DPDK rte_memcpy and glibc memcpy in OvS-DPDK

Continue the Conversation

Join the DPDK mailing list, dev@dpdk.org, where your feedback and questions about rte_memcpy are welcomed.

About the Author

Zhihong Wang is a software engineer at Intel. He has worked on various areas, including CPU performance benchmarking and analysis, packet processing performance optimization, network virtualization.

Appendix A

Test Environment

PVP flow: Ixia* sends the packet to the physical network card, OvS-DPDK forwards the packet received by the physical network card to the virtual machine, the virtual machine processes the packet and sends it back to the physical network card by the OvS-DPDK, and finally back to Ixia
Virtual machine will use MAC-forwarding using DPDK testpmd
OvS-DPDK Version: Commit f56f0b73b67226a18f97be2198c0952dad534f1c
DPDK Version: 17.02
GCC/GLIBC Version: 6.2.1/2.23
Linux*: 4.7.5-200.fc24.x86_64
CPU: Intel® Xeon® processor E5-2699 v3 at 2.30GHz

OvS-DPDK Compile and Boot Commands

./ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true ./ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024" ./ovs-vsctl add-br ovsbr0 -- set bridge ovsbr0 datapath_type=netdev ./ovs-vsctl add-port ovsbr0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser ./ovs-vsctl add-port ovsbr0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:06:00.0 ./ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10000 ./ovs-ofctl del-flows ovsbr0 ./ovs-ofctl add-flow ovsbr0 in_port=1,action=output:2 ./ovs-ofctl add-flow ovsbr0 in_port=2,action=output:1

Use DPDK testpmd for Virtual Machine Forwarding

set fwd mac start

↧

Configure SR-IOV Network Virtual Functions in Linux* KVM*

June 19, 2017, 10:18 am

Latest and popular articles on Intel Technologies

≫ Next: Trion Worlds: Moving With the Times

≪ Previous: Performance Optimization of memcpy in DPDK

Introduction

This tutorial demonstrates several different ways of using single root input/output virtualization (SR-IOV) network virtual functions (VFs) in Linux* KVM* virtual machines (VMs) and discusses the pros and cons of each method.

Here’s the short story: use the KVM virtual network pool of SR-IOV adapters method. It has the same performance as the VF PCI* passthrough method, but it’s much easier to set up. If you must use the macvtap method, use virtio as your device model because every other option will give you horrible performance. And finally, if you are using a 40 Gbps Intel® Ethernet Server Adapter XL710, consider using the Data Plane Development Kit (DPDK) in the guest; otherwise you won’t be able to take full advantage of the 40 Gbps connection.

There are a few downloads associated with this tutorial that you can get from github.com/intel:

A script that groups virtual functions by their physical functions,
A script that lists virtual functions and their PCI and parent physical functions, and
The KVM XML virtual machine definition file for the test VM

SR-IOV Basics

SR-IOV provides the ability to partition a single physical PCI resource into virtual PCI functions which can then be injected into a VM. In the case of network VFs, SR-IOV improves north-south network performance (that is, traffic with endpoints outside the host machine) by allowing traffic to bypass the host machine’s network stack.

Supported Intel Network Interface Cards

A complete list of Intel Ethernet Server Adapters and Intel® Ethernet Controllers that support SR-IOV is available online, but in this tutorial, I evaluated just four:

The Intel® Ethernet Server Adapter X710, which supports up to 32 VFs per port
The Intel Ethernet Server Adapter XL710, which supports up to 64 VFs per port
The Intel® Ethernet Controller X540-AT2, which supports 32 VFs per port
The Intel® Ethernet Controller 10 Gigabit 82599EB, which supports 32 VFs per port

Assumptions

There are several different ways to inject an SR-IOV network VF into a Linux KVM VM. This tutorial evaluates three of those ways:

As an SR-IOV VF PCI passthrough device
As an SR-IOV VF network adapter using macvtap
As an SR-IOV VF network adapter using a KVM virtual network pool of adapters

Most of the steps in this tutorial can be done using either the command line virsh tool or using the virt-manager GUI. If you prefer to use the GUI, you’ll find screenshots to guide you; if you are partial to the command line, you’ll find code and XML snippets to help. Note that there are several steps in this tutorial that cannot be done via the GUI.

Network Configuration

The test setup included two physical servers—net2s22c05 and net2s18c03—and one VM—sr-iov-vf-testvm—that was hosted on net2s22c05. Net2s22C05 had one each of the four Intel Ethernet Server Adapters listed above with one port in each adapter directly linked to a NIC port with equivalent link speed in net2s18c03. The NIC ports on each system were in the same subnet: those on net2s18c03 all had static IP addresses with .1 as the final dotted quad, the net2s22c05 ports had .2 as the final dotted quad, and the virtual ports in sr-iov-vf-testvm all had .3 as the final dotted quad:

Network Configuration

System Configuration

Host Configuration

CPU	2-Socket, 22-core Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz
Memory	128 GB
NIC	Intel® Ethernet Controller X540-AT2
	Intel® 82599 10 Gigabit TN Network Connection
	Intel® Ethernet Controller X710 for 10GbE SFP+
	Intel® Ethernet Controller XL710 for 40GbE QSFP+
Operating System	Ubuntu* 16.04 LTS
Kernel parameters	GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt”

Guest Configuration

The following XML snippets are taken via # virsh dumpxml sr-iov-vf-testvm

CPU	<vcpu placement='static'>8</vcpu><cpu mode='host-passthrough'><topology sockets='1' cores='8' threads='1'/></cpu>
Memory	<memory unit='KiB'>12582912</memory><currentMemory unit='KiB'>12582912</currentMemory>
NIC	<interface type='network'><mac address='52:54:00:4d:2a:82'/><source network='default'/><model type='rtl8139'/><address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/></interface>
	The SR-IOV NIC XML tag varied based the configurations discussed in this tutorial.
Operating System	Ubuntu 14.04 LTS. Note: This OS and Linux* kernel version were chosen based on a specific usage. Otherwise, newer versions would have been used.
Linux* Kernel Version	3.13.0-24-lowlatency
Software	ufw purged lshw installed Note: Ubuntu 14.04 LTS did not come with the i40evf driver preinstalled. I built the driver from source and then loaded it into the kernel. I used version 2.0.22. Instructions for building and loading the driver are located in the README file.

The complete KVM definition file is available online.

Scope

This tutorial does not focus on performance. And even though the performance of the Intel Ethernet Server Adapter XL710 SR-IOV connection listed below clearly demonstrates the value of the DPDK, this tutorial does not focus on configuring SR-IOV VF network adapters to use DPDK in the guest VM environment. For more information on this topic, see the Single Root IO Virtualization and Open vSwitch Hands-On Lab Tutorials. You can find detailed instructions on how to set up SR-IOV VFs on the host in this SR-IOV Configuration Guide and the video Creating Virtual Functions using SR-IOV. But to get you started, once you have enabled iommu=pt and intel_iommu=on as kernel boot parameters, if you are running a Linux kernel that is at least 3.8.x, to initialize SR-IOV VFs issue the following command:

     #echo 4 > /sys/class/net/<device name>/device/sriov_numvfs

Once an SR-IOV NIC VF is created on the host, the driver/OS assigns a MAC address and creates a network interface for the VF adapter.

Parameters

When evaluating the advantages and disadvantages of each insertion method, I looked at the following:

Host device model
PCI device information as reported in the VM
Link speed as reported by the VM
Host driver
Guest driver
Simple performance characteristics using iperf
Ease of setup

Host Device Model

This is the device type that is specified when the SR-IOV network adapter is inserted into the VM. In the virt-manager GUI, the following typical options are available:

Hypervisor default (which in our configuration defaulted to rtl8139
rtl8139
e1000
virtio

Additional options were available on our test host machine, but they had to be entered into the VM XML definition using # virsh edit. I additionally evaluated the following:

ixgbe
i82559er

VM Link Speed

I evaluated link speed of the SR-IOV VF network adapter in the VM using the following command:

     # ethtool eth1 | grep Speed

VM Link Speed

Host Network Driver

This is the driver that the KVM Virtual Machine Manager (VMM) uses for the NIC as displayed in the <driver> XML tag when I ran the following command on the host after starting the VM:

     # virsh dumpxml sr-iov-vf-testvm | grep -w hostdev -A9

Host Network Driver

Guest Network Driver

This is the driver that the VM uses for the NIC. I found the information by first determining the SR-IOV NIC PCI interface information in the VM:

     # lshw -c network –businfo

Guest Network Driver

Using this PCI bus information, I then ran the following command to find what driver the VM had loaded into the kernel for the SR-IOV NIC:

     # lspci -vmmks 00:03.0

Guest Network Driver

Performance

Because this is not a performance-oriented paper, this data is provided only to give a rough idea of the performance of different configurations. The command I ran on the server system was

     # iperf -s -f m

Performance

And the client command was:

     # iperf -c <server ip address> -f m -P 2

Performance

I only did one run with the test VM as the server and one run with the test VM as a client.

Ease of Setup

This is an admittedly subjective evaluation parameter. But I think you’ll agree that there was a clear loser: the option of inserting the SR-IOV VF as a PCI passthrough device.

SR-IOV Virtual Function PCI Passthrough Device

The most basic way to connect an SR-IOV VF to a KVM VM is by directly importing the VF as a PCI device using the PCI bus information that the host OS assigned to it when it was created.

Using the Command Line

Once the VF has been created, the network adapter driver automatically creates the infrastructure necessary to use it.

Step 1: Find the VF PCI bus information.

In order to find the PCI bus information for the VF, you need to know how to identify it, and sometimes the interface name that is assigned to the VF seems arbitrary. For example, in the following figure there are two VFs and the PCI bus information is outlined in red, but it is impossible to determine from this information which physical port the VFs are associated with.

     # lshw -c network -businfo

Find the VF PCI bus information.

The following bash script lists all the VFs associated with a physical function.

#!/bin/bash

NIC_DIR="/sys/class/net"
for i in $( ls $NIC_DIR) ;
do
	if [ -d "${NIC_DIR}/$i/device" -a ! -L "${NIC_DIR}/$i/device/physfn" ]; then
		declare -a VF_PCI_BDF
		declare -a VF_INTERFACE
		k=0
		for j in $( ls "${NIC_DIR}/$i/device" ) ;
		do
			if [[ "$j" == "virtfn"* ]]; then
				VF_PCI=$( readlink "${NIC_DIR}/$i/device/$j" | cut -d '/' -f2 )
				VF_PCI_BDF[$k]=$VF_PCI
				#get the interface name for the VF at this PCI Address
				for iface in $( ls $NIC_DIR );
				do
					link_dir=$( readlink ${NIC_DIR}/$iface )
					if [[ "$link_dir" == *"$VF_PCI"* ]]; then
						VF_INTERFACE[$k]=$iface
					fi
				done
				((k++))
			fi
		done
		NUM_VFs=${#VF_PCI_BDF[@]}
		if [[ $NUM_VFs -gt 0 ]]; then
			#get the PF Device Description
			PF_PCI=$( readlink "${NIC_DIR}/$i/device" | cut -d '/' -f4 )
			PF_VENDOR=$( lspci -vmmks $PF_PCI | grep ^Vendor | cut -f2)
			PF_NAME=$( lspci -vmmks $PF_PCI | grep ^Device | cut -f2).
			echo "Virtual Functions on $PF_VENDOR $PF_NAME ($i):"
			echo -e "PCI BDF\t\tInterface"
			echo -e "=======\t\t========="
			for (( l = 0; l < $NUM_VFs; l++ )) ;
			do
				echo -e "${VF_PCI_BDF[$l]}\t${VF_INTERFACE[$l]}"
			done
			unset VF_PCI_BDF
			unset VF_INTERFACE
			echo ""
		fi
	fi
done

With the PCI bus information from this script, I imported a VF from the first port on my Intel Ethernet Controller X540-AT2 as a PCI passthrough device.

PCI passthrough device

Step 2: Add a hostdev tag to the VM.

Using the command line, use # virsh edit <VM name> to add a hostdev XML tag to the machine. Use the host machine PCI Bus, Domain, and Function information from the bash script above for the source tag’s address domain, bus, slot, and function attributes.

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
<hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0000' bus='0x03' slot='0x10' function='0x0'/></source></hostdev>
…
</devices>
…
</domain>

Once you exit the virsh edit command, KVM automatically adds an additional <address> tag to the hostdev tag to allocate the PCI bus address in the VM.

Step 3: Start the VM.

     # virsh start <name of virtual machine>

Start the VM.

Using the GUI

Note: I have not found an elegant way to discover the SR-IOV PCI bus information using graphical tools.

Step 1: Find the VF PCI bus information.

See the commands from Step 1 above.

Step 2: Add a PCI host device to the VM.

Once you have the host PCI bus information for the VF, using the virt-manager GUI, click Add Hardware.

Add a PCI host device to the VM.

After selecting PCI Host Device, you’ll see an array of PCI devices shown that can be imported into our VM.

Add a PCI host device to the VM.

Give keyboard focus to the Host Device drop-down list, and then start typing the PCI Bus Device Function information from the bash script above, substituting a colon for the period (‘03:10:0’ in this case). After the desired VF comes into focus, click Finish.

Add a PCI host device to the VM.

The PCI device just imported now shows up in the VM list of devices.

Add a PCI host device to the VM.

Step 3: Start the VM.

Add a PCI host device to the VM.

Summary

When using this method of directly inserting the PCI host device into the VM, there is no ability to change the host device model: for all NIC models, the host used the vfio driver. The Intel Ethernet Servers Adapters XL710 and X710 adapters used the i40evf driver in the guest, and for both, the VM PCI Device information reported the adapter name as “XL710/X710 Virtual Function.” The Intel Ethernet Controller X540-AT2 and Intel 82599 10 Gigabit Ethernet Controller adapters used the ixgbevf driver in the guest, and the VM PCI device information reported “X540 Ethernet Controller Virtual Function” and “82599 Ethernet Controller Virtual Function” respectively. With the exception of the XL710, which showed a link speed of 40 Gbps, all 10 GB adapters showed a link speed of 10 Gbps. For the X540, 82599, and X710 adapters, the iperf test ran at nearly line rate (~9.4 Gbps), and performance was roughly ~8 percent worse when the VM was the iperf server versus when the VM was the iperf client. While the XL710 performed better than the 10 Gb NICs, it performed at roughly 70 percent line rate when the iperf server ran on the VM, and at roughly 40 percent line rate when the iperf client was on the VM. This disparity is most likely due to the kernel being overwhelmed by the high rate of I/O interrupts, a problem that would be solved by using DPDK.

The one advantage to this method is that it allows control over which VF is inserted into the VM, whereas the virtual network pool of adapters method does not. This method of injecting an SR-IOV VF network adapter into a KVM VM is the most complex to set up and provides the fewest host device model options. Performance is not significantly different than the method that involves a KVM virtual network pool of adapters. However, that method is much simpler to use. Unless you need control over which VF is inserted into your VM, I don’t recommend using this method.

SR-IOV Network Adapter Macvtap

The next way to add an SR-IOV network adapter to a KVM VM is as a VF network adapter connected to a macvtap on the host. Unlike the previous method, this method does not require you to know the PCI bus information for the VF, but you do need to know the name of the interface that the OS created for the VF when it was created.

Using the command-line

Much of this method of connecting an SR-IOV VF to a VM can be done via the virt-manager GUI, but step 1 must be done using the command line.

Step 1: Determine the VF interface name

As shown in the following figure, after creating the VF, use the bash script listed above to display the network interface names and PCI bus information assigned to the VFs.

Determine the VF interface name

With this information, insert the VFs into your KVM VM using either the virt-manager GUI or the virsh command line.

Step 2: Add an interface tag to the VM.

To use the command-line with the macvtap adapter solution, with the VM shut off, edit the VM configuration file and add an ‘interface’ tag with sub-elements and attributes shown below. The interface ‘type’ is ‘direct’, and the ‘dev’ attribute of the ‘source’ sub-element must point to the interface name that the host OS assigned to the target VF. Be sure to specify the ‘mode’ attribute of the ‘source’ element as ‘passthrough’:

# virsh edit <name of virtual machine>
# virsh dump <name of virtual machine><domain>
…
<devices>
…
   <interface type='direct'><source dev='enp3s16f1' mode='passthrough'/></interface>
…
</devices>
…
</domain>

Once the editor is closed, KVM automatically assigns a MAC address to the SR-IOV interface, uses the default model type value of rtl8139, and assigns the NIC a slot on the VM’s PCI bus.

Step 3: Start the VM.

Start the VM.

As the VM starts, KVM creates a macvtap adapter ‘macvtap0’ on the VF specified. On the host, you can see that the macvtap adapter that KVM created for your VF NIC uses a MAC address that is different than the MAC address on the other end of the macvtap in the VM:

     # ip l | grep enp3s16f1 -A1

Start the VM.

The fact that there are two MAC addresses assigned to the same VF—one by the host OS and one by the VM—suggests that the network stack using this configuration is more complex and likely slower.

Using the GUI

With the exception of determining the interface name of the desired VF, all the steps of this method can be done using the virt-manager GUI.

Step 1: Determine the VF interface name.

See the command line Step 1 above.

Step 2: Add the SR-IOV macvtap adapter to the VM.

Using virt-manager, add hardware to the VM.