Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1142 articles
Browse latest View live

Intel® Parallel Studio XE 2016 Update 4 Readme

$
0
0

Intel® Parallel Studio XE 2016 Update 4 for Linux*, Windows*, and OS X*

Deliver top application performance and reliability with the Intel® Parallel Studio XE 2016 Update 4. This software development suite combines Intel's C/C++ compiler and Fortran compiler; performance and parallel libraries; error checking, code robustness, and performance profiling tools into a single suite offering.

Key Features

  • Faster code: Boost applications performance that scales on today’s and next-gen processors
  • Create code faster: Utilize a toolset that simplifies creating fast, reliable parallel applications

This package is for users who develop on and build for IA-32 and Intel® 64 architectures on Linux*, Windows*, and OS X*, as well as customers running over the Intel® Xeon Phi™ coprocessor on Linux*. There are currently 3 editions of the suite:

  • Intel® Parallel Studio XE 2016 Update 4 Composer Edition, which includes:
    • Intel® C++ Compiler 16.0 Update 4
    • Intel® Fortran Compiler 16.0 Update 4
    • Intel® Data Analytics Acceleration Library (Intel® DAAL) 2016 Update 4
    • Intel® Integrated Performance Primitives (Intel® IPP) 9.0 Update 4
    • Intel® Math Kernel Library (Intel® MKL) 11.3 Update 4
    • Intel® Threading Building Blocks (Intel® TBB) 4.4 Update 6
    • Intel-provided Debug Solutions
  • Intel® Parallel Studio XE 2016 Update 4 Professional Edition adds the following utilities:
    • Intel® VTune™ Amplifier XE 2016 Update 4
    • Intel® Advisor XE 2016 Update 4
    • Intel® Inspector XE 2016 Update 3
  • Intel® Parallel Studio XE 2016 Update 4 Cluster Edition includes all previous tools plus:
    • Intel® MPI Library 5.1 Update 3
    • Intel® Trace Analyzer and Collector 9.1 Update 2
    • Intel® Cluster Checker 3.1 Update 2 (Linux* only)
    • Intel® MPI Benchmarks 4.1 Update 1

New in this release:

  • Following components updated to latest version:
    • Intel® C++ Compiler 16.0 Update 4
    • Intel® Fortran Compiler 16.0 Update 4
    • Intel® Data Analytics Acceleration Library (Intel® DAAL) 2016 Update 4
    • Intel® Integrated Performance Primitives (Intel® IPP) 9.0 Update 4
    • Intel® Math Kernel Library (Intel® MKL) 11.3 Update 4
    • Intel® Threading Building Blocks (Intel® TBB) 4.4 Update 6
    • Intel® VTune™ Amplifier XE 2016 Update 4
  • Bug fixes and documentation updates

For more information on the changes listed above, please read the individual component release notes available from the main Intel® Parallel Studio XE Release Notes page.

Resources:

Contents:

  • Linux* packages
    • File: parallel_studio_xe_2016_update4.tgz
      Offline Installer package which has bigger size and contains all components of the product
    • File: parallel_studio_xe_2016_update4_online.sh
      Online Installer which has smaller file size. This installer may save you download time as it allows you to select only those components you desire to download. You must be connected to the internet during installation with this installer.
    • File: parallel_studio_xe_2016_composer_edition_update4.tgz
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for Fortran and C++ Linux* only
    • File: parallel_studio_xe_2016_composer_edition_for_cpp_update4.tgz
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for C++ Linux* only
    • File: parallel_studio_xe_2016_composer_edition_for_fortran_update4.tgz
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for Fortran Linux* only
    • File: l_comp_lib_2016.4.258_comp.cpp_redist.tgz
      Redistributable Libraries C++
    • File: l_comp_lib_2016.4.258_comp.for_redist.tgz
      Redistributable Libraries Fortran
    • File: get-ipp-90-crypto-library.htm
      Directions on how to obtain the Cryptography Library
  • Windows* packages
    • File: parallel_studio_xe_2016_update4_setup.exe
      Offline Installer package which has bigger size and contains all components of the product
    • File: parallel_studio_xe_2016_update4_online_setup.exe
      Online Installer which has smaller file size. This installer may save you download time as it allows you to select only those components you desire to download. You must be connected to the internet during installation with this installer.
    • File: parallel_studio_xe_2016_update4_composer_edition_setup.exe
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for Fortran and C++ Windows* only
    • File: parallel_studio_xe_2016_update4_composer_edition_for_cpp_setup.exe
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for C++ Windows* only
    • File: parallel_studio_xe_2016_update4_composer_edition_for_fortran_setup.exe
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for Fortran Windows* only
    • File: ww_icl_redist_msi_2016.4.246.zip
      Redistributable Libraries for 32-bit and 64-bit msi files for the Intel® Parallel Studio XE Composer Edition for C++
    • File: ww_ifort_redist_msi_2016.4.246.zip
      Redistributable Libraries for 32-bit and 64-bit msi files for the Intel® Parallel Studio XE Composer Edition for Fortran
    • File: get-ipp-90-crypto-library.htm
      Directions on how to obtain the Cryptography Library
  • OS X* packages
    • File: m_ccompxe_2016.4.070.dmg
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for C++ OS X* which has bigger size and contains all components of the product
    • File: m_ccompxe_online_2016.4.070.dmg
      Online Installer which has smaller file size. This installer may save you download time as it allows you to select only those components you desire to download. You must be connected to the internet during installation with this installer.
    • File: m_comp_lib_icc_redist_2016.4.210.dmg
      Redistributable Libraries C++
    • File: m_fcompxe_2016.4.070.dmg
      Offline Installer package for the Intel® Parallel Studio XE Composer Edition for Fortran OS X* which has bigger size and contains all components of the product
    • File: m_fcompxe_online_2016.4.070.dmg
      Online Installer which has smaller file size. This installer may save you download time as it allows you to select only those components you desire to download. You must be connected to the internet during installation with this installer.
    • File: m_comp_lib_ifort_redist_2016.4.210.dmg
      Redistributable Libraries Fortran
    • File: get-ipp-90-crypto-library.htm
      Directions on how to obtain the Cryptography Library

Uninstalling Intel® Compiler 17.0 causes 16.0 IDE Integration fails to install

$
0
0

Problem

I have Intel® Parallel Studio XE Composer Edition 2011 through 2017 installed on this machine.

I uninstalled Intel Compiler 17.0, repaired the 16.0 update 3 installation, and started Visual Studio but there was still no IDE integration. I uninstalled 16.0 update 3, and reinstalled it. Now when I start Visual Studio I get a dialog that states:

"The 'IntelCommonPkg' package did no load correctly."

The ActivityLog contained 3 issues:

C:\ProgramData\Microsoft\VisualStudio\12.0\1033\devenv.CTM was out of date and couldn't be removed.
42 ERROR SetSite failed for package [IntelCommonPkg] {397E715C-BFAE-47AB-8F49-1538DDD77757} 80070002 VisualStudio 2016/09/08 02:56:11.313
43 ERROR End package load [IntelCommonPkg] {397E715C-BFAE-47AB-8F49-1538DDD77757} 80070002 VisualStudio 2016/09/08 02:56:11.314 

Root cause

We had issue in some old products like Intel® Composer XE 2011, that can be cause of such issue with IDE. If some of old product was installed previously on machine then we can see this issue with any new version of Intel Parallel Studio XE.

We cannot eliminate this issue in new version of Intel® Parallel Studio XE because fix is required in old version which are end of life now. Currently nothing can be fixed in Intel® Parallel Studio XE 2017 or Intel® Parallel Studio XE 2016, but we provide below workaround.

Workaround

To resolve this situation please do the following steps: 

1. Uninstall 16.0 Update 3 product.
2. Manually remove the following files and directories if exists, and backup them to some temporary place: 
   a) Directory C:\Windows\Microsoft.NET\assembly\GAC_MSIL\Intel.Misc.Utilities 
   b) Directory "C:\Program Files (x86)\Common Files\Intel\shared files"
   c) Directories "C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\Extensions\Intel\C++" 
                       "C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\Extensions\Intel\Common" 
                       "C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\Extensions\Intel\PerformanceGuide" 
   d) Files "C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\PublicAssemblies\IntelCppOptPkg.dll" 
              "C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\PublicAssemblies\IntelLibOptPkg.dll" 
              "C:\Program Files (x86)\Microsoft Visual Studio 12.0\Common7\IDE\PublicAssemblies\IntelPkg.dll"
   e) File "C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\Intel.Build.ICLTasks.v120.dll" 
3. Install 16.0 Update 3 product. 

The steps above clean the system from all artifacts which caused problem with VS IDE integration. 

Developer Success Stories Library

$
0
0

Learn how leading organizations worldwide are using development tools from Intel to boost performance, save development time and costs, and better meet their customers' needs.

Intel® Parallel Studio | Intel® System Studio | Intel® Cluster Studio XE | Intel® Inspector XE | Intel® Integrated Performance Primitives | Intel® MPI Library | Intel® Math Kernel Library | Intel® Threading Building Blocks | Intel® Media Server Studio

Intel® Parallel Studio

 

CADEX Resolves the Challenges of CAD Format Conversion

CAD Exchanger logo Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

 

Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology logo Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.

 

Pexip Speeds Enterprise-Grade Videoconferencing

Pexip logo Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.

 

Ural Federal University Boosts High-Performance Computing Education and Research

Ural Federal University logo Intel® Developer Tools and online courseware enrich the high-performance computing curriculum at Ural Federal University.

 

Walker Molecular Dynamics Laboratory Optimizes its Molecular Dynamics Software for Advanced HPC Computer Architectures

San Diego Supercomputer Center logo Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

 

Intel® System Studio

 

CID Wireless Shanghai Boosts Long-Term Evolution (LTE) Application Performance

CID Group logo CID Wireless boosts performance for its LTE reference design code by 6x compared to the plain C code implementation.

 

 

 

Intel® Cluster Studio XE

 

Schlumberger Parallelizes Oil and Gas Software with Intel® Software Development Tools

Schlumberger logo Schlumberger increases performance for its PIPESIM* software up to 10x while streamlining the development process.

 

Intel® Inspector XE

 

CADEX Resolves the Challenges of CAD Format Conversion

CAD Exchanger logo Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

 

Intel® Integrated Performance Primitives

 

JD.com Optimizes Image Processing

JD.com logo JD.com Speeds Image Processing 17x, handling 300,000 images in 162 seconds instead of 2,800 seconds, with Intel® C++ Compiler and Intel® Integrated Performance Primitives.

 

Tencent Optimizes an Illegal Image Filtering System

Tencent.com logo Tencent doubles the speed of its illegal image filtering system using SIMD Instruction Set and Intel® Integrated Performance Primitives.

 

Tencent Speeds MD5 Image Identification by 2x

Tencent.com logo Intel worked with Tencent engineers to optimize the way the company processes millions of images each day, using Intel® Integrated Performance Primitives to achieve a 2x performance improvement.

 

Walker Molecular Dynamics Laboratory Optimizes its Molecular Dynamics Software for Advanced HPC Computer Architectures

San Diego Supercomputer Center logo Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

 

Intel® MPI Library

 

Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

Moscow Institute of Physics and Technology logo Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.

 

Walker Molecular Dynamics Laboratory Optimizes its Molecular Dynamics Software for Advanced HPC Computer Architectures

San Diego Supercomputer Center logo Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

 

Intel® Math Kernel Library

 

Qihoo360 Technology Co. Ltd.

Qihoo360 Technology logo Qihoo360 optimizes optimize the speech recognition module of the Euler platform using Intel® Math Kernel Library (Intel® MKL), speeding up performance by 5x.

 

Intel® Threading Building Blocks

 

CADEX Resolves the Challenges of CAD Format Conversion

CAD Exchanger logo Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.

 

Johns Hopkins University Prepares for a Many-Core Future

Johns Hopkins UniversityJohns Hopkins University increases the performance of its open-source Bowtie 2* application by adding multi-core parallelism.

 

Pexip Speeds Enterprise-Grade Videoconferencing

Pexip logo Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.

 

 

 

 

Quasardb Streamlines Development for a Real-Time Analytics Database

To deliver first-class performance for its distributed, transactional database, Quarsardb uses Intel® Threading Building Blocks (Intel® TBB), Intel’s C++ threading library for creating high-performance, scalable parallel applications.

 

Walker Molecular Dynamics Laboratory Optimizes its Molecular Dynamics Software for Advanced HPC Computer Architectures

San Diego Supercomputer Center logo Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.

 

Intel® Media Server Studio

 

ActiveVideo Enhances Efficiency

Active Video Logo ActiveVideo boosts the scalability and efficiency of its cloud-based virtual set-top box solutions for TV guides, online video, and interactive TV advertising using Intel® Media Server Studio.

 

Kraftway: Video Analytics at the Edge of the Network

Today’s sensing, processing, storage, and connectivity technologies enable the next step in distributed video analytics, where each camera itself is a server. With Kraftway* video software, designed using Intel® Media Server Studio, platforms based on the Intel® Atom™ processor E3800 series can encode up to three 1080p60 streams at different bit rates with close to zero CPU load.

 

iStreamPlanet Transforms Live Video Streaming

iStream Logo By moving video transcoding from traditional blades to the HP Moonshot System, based on Intel® Media Server Studio, iStreamPlanet can encode live video streams in 10x less data center space with up to 5x less power consumption.

 

Vantrix Delivers on Media Transcoding Performance

Vantrix Logo HP Moonshot* with HP ProLiant* m710p server cartridges and Vantrix Media Platform software, with help from Intel® Media Server Studio, deliver a cost-effective solution that delivers more streams per rack unit while consuming less power and space.

Robotics Development Kit R200 Depth-Data Interpretation

$
0
0

Intel® RealSense™ R200 Camera

The Intel RealSense R200 camera is an active stereo camera with a 70-mm stereo baseline.

RealSense™ R200 Camera Figure 1

Indoors, the Intel RealSense camera R200 uses a class-1 laser device to project additional texture into a scene for better stereo performance. The Intel RealSense R200 camera works in disparity space, where disparity space is defined as the spatial shift in the same 3D point in space between the left and right laser. The larger the shift in the horizontal plane, the closer the object (depth is inversely proportional to the disparity). You can simulate the same effect by holding your thumb at eye level and looking at it one eye at a time.

RealSense™ R200 Camera Figure 2

The Intel RealSense R200 camera has a maximum search range of 63 pixels horizontally, with a result of a 72-cm minimum depth distance at the nominal 628×468 resolution. At 320×240, the minimum depth distance reduces to 32 cm. The laser texture from multiple Intel RealSense camera R200 devices produces constructive interference. Constructive interference is the interference pattern of equal frequency and phase, resulting in their mutual reinforcement and producing a single amplitude equal to the sum of the amplitudes of the individual waves. This results in the feature that multiple Intel RealSense R200 cameras can be collocated in the same environment. The dual IR cameras are global shutter (where every pixel is exposed simultaneously at the same time) while 1080p RGB imager is rolling shutter (in that each row in a frame will expose for the same amount of time but begin exposing at a different point in time, allowing overlapping exposures for two frames). An internal clock triggers all three image sensors as a group and the Intel® RealSense™ Cross Platform API provides matched frame sets.

Outdoors, the laser has no effect over ambient infrared from the sun. Furthermore, at default settings, IR sensors can become oversaturated in a fully sunlit environment so gain/exposure/frames-per-second tuning might be required. The recommended indoor depth range is around 3.5 m.

Depth Projections

Mapping from 2D pixel coordinates to 3D point coordinates via the rs_intrinsics structure and the rs_deproject_pixel_to_point(...) function requires knowledge of the depth of that pixel in meters. Certain pixel formats exposed by librealsense contain per-pixel depth information and can be immediately used with this function. Other images do not contain per-pixel depth information and thus would typically be projected into instead of deprojected (reversing the projection of depth into the scene) from.

  • RS_FORMAT_Z16 or rs::format::z16

    • Depth is stored as one unsigned 16-bit integer per pixel mapped linearly to depth in camera-specific units. The distance, in meters, corresponding to one integer increment in depth values can be queried via rs_get_device_depth_scale(...). The following pseudocode shows how to retrieve the depth of a pixel in meters:
      • const float scale = rs_get_device_depth_scale(dev, NULL);
      • const uint16_t * image = (const uint16_t *)rs_get_frame_data(dev, RS_STREAM_DEPTH, NULL);
      • float depth_in_meters = scale * image[pixel_index];
    • If a device fails to determine the depth of a given image pixel, a value of zero will be stored in the depth image. This is a reasonable sentinel for "no depth" because all pixels with a depth of zero would correspond to the same physical location, the location of the imager itself.
    • The default scale (the smallest unit of precision attainable by the device) of an Intel RealSense camera (F200) or Intel RealSense camera (SR300) device is 1/32th of a millimeter. Allowing for 16 bits (or 2 to the power 16 = 65536) of units translates to a maximum expressive range of two meters. However, the scale is encoded into the camera's calibration information, potentially allowing for long-range models to use a different scaling factor.
    • The default scale of an Intel RealSense camera (R200) device is one millimeter, allowing for a maximum expressive range of ~65 meters. The depth scale can be modified by calling rs_set_device_option(...) with RS_OPTION_R200_DEPTH_UNITS, which specifies the number of micrometers per one increment of depth. 1000 would indicate millimeter scale, 10000 would indicate centimeter scale, while 31 would roughly approximate the Intel RealSense camera (F200) 1/32th of a millimeter scale.
  • RS_FORMAT_DISPARITY16 or rs::format::disparity16

    • Depth is stored as one unsigned 16-bit integer, as a fixed point representation of pixels of disparity. Stereo disparity is related to depth via an inverse linear relationship, and the distance of a point which registers a disparity of 1 can be queried via rs_get_device_depth_scale(...). The following pseudocode shows how to retrieve the depth of a pixel in meters:
      • const float scale = rs_get_device_depth_scale(dev, NULL);
      • const uint16_t * image = (const uint16_t *)rs_get_frame_data(dev, RS_STREAM_DEPTH, NULL);
      • float depth_in_meters = scale / image[pixel_index];
    • Unlike RS_FORMAT_Z16, a disparity value of zero is meaningful. A stereo match with zero disparity will occur for objects "at infinity," objects that are so far away that the parallax between the two imagers is negligible. By contrast, there is a maximum possible disparity. The Intel RealSense camera (R200) only matches up to 63 pixels of disparity in hardware, and even if a software stereo search were run on an image, you would never see a disparity greater than the total width of the stereo image. Therefore, when the device fails to find a stereo match for a given pixel, a value of 0xFFFF will be stored in the depth image as a sentinel.
    • Disparity is currently only available on the Intel RealSense camera (R200), which by default uses a ratio of 32 units in the disparity map to one pixel of disparity. The ratio of disparity units to pixels of disparity can be modified by calling rs_set_device_option(...) with RS_OPTION_R200_DISPARITY_MULTIPLIER. For instance, setting it to 100 would indicate that 100 units in the disparity map are equivalent to one pixel of disparity.

Depth Calculation

Since optical axes are always parallel and focal lengths are the same, the Intel RealSense camera (R200) internal circuitry determines d (disparity) based on the stereo baseline (B) and the focal length (f). “Xl” and “Xr” are the shifts that the left and right cameras see. (Xl,Yl) and (Xr,Yr) are the corresponding image points. Imagine the Y axis is perpendicular to the image toward you. Disparity is (Xl-Xr).

Using the concept of triangulation, we can now obtain depth:

Depth (Z) = (baseline * focal length)/disparity

RealSense™ R200 Camera Figure 3

Conclusion

The Intel RealSense R200 depth-camera provides depth data as a 16-bit number that can be easily converted to canonical distances measured in meters on a per-pixel level. As such, it is possible to extract scene information by any number of algorithms beyond those provided by RGB data alone. Thus it is possible to combine the RGB pixels and depth pixels together to produce point-clouds that represent a sampling in 3D of the scene the camera looking at.

References

Intel® IPP 2017 Bug Fixes

$
0
0

NOTE: Defects and feature requests described below represent specific issues with specific test cases. It is difficult to succinctly describe an issue and how it impacted the specific test case. Some of the issues listed may impact multiple architectures, operating systems, and/or languages. If you have any questions about the issues discussed in this report, please post on the user forums, http://software.intel.com/en-us/forums or submit an issue to Intel® Premier Support, https://premier.intel.com.

Intel® IPP 2017  (6 Sep 2016)

DPD200581260ippsRegExpFind_8u generating stack overflow 
DPD200583297ippiCopy_32f_C1R producing wrong results on AVX optimization code.  Workaround: use ippsCopy_32f on each image line

Smarter Security Camera: a POC using the Intel® IoT Gateway

$
0
0

Intro

Internet of Things is enabling our lives in new and interesting ways, but with it come the challenges of how to analyze and bring meaning to all the continuously generated data. One IoT trend in the home is the rise of security cameras, and not just one or two, but multiple ones for around the house and in each room to monitor the status. This creates massive amounts of data when images or movie files are being saved. Taking one house as an example, they have 12 cameras that generate around 5 GB per day taking over 180,000 images total. That is a massive amount of data to look through manually. Some cameras have built in motion sensors to only take images when a change is detected, and while this helps to reduce the noise, light changes, pets, fans, and things moving in the wind will still be picked up and have to be sorted through. To monitor for what is wanted OpenCV presents a promising solution, for the purposes of this paper it will be people and faces. OpenCV already has a number of pre-defined algorithms to search images for faces, people, and objects and can also be trained to recognize new ones.

This article is a proof of concept to explore quickly prototyping an analytics solution at the edge using the Intel IoT Gateway computing power to create a Smarter Security Camera.

Analyzed image from webcam with OpenCV detection markers

Figure 1: Analyzed image from webcam with OpenCV detection markers

Set-up

It all starts with a Logitech C270 Webcam with HD 720P resolution and 2.4 GHz Intel Core 2 Duo. This webcam plugs into the USB port of the Intel Edison which turns it into an IP webcam streaming the video a website. Using the webcam with the Intel Edison allows for easy duplication of the camera “sensor” to be propagated to different locations around a home. The Intel IoT Gateway then captures images from the stream and uses OpenCV to analyze them. If the algorithms detects that there is a face or a person in view, it uploads the image to Twitter*.

Intel Edison and Webcam setup

Figure 2: Intel Edison and Webcam setup

Intel Gateway Device

Figure 3: Intel® IoT Gateway

Capturing the image

The webcam must be UVC compliant to ensure that it is compatible with the Intel Edison’s USB drivers, in this case the Logitech C270 Webcam is used. For a list of UVC compliant devices see this webpage here: http://www.ideasonboard.org/uvc/#devices. To use the USB slot the Intel Edison’s micro switch must be toggled up towards the USB slot, note that this will disable the micro-USB next to it and hence disable Ethernet, power (the external power supply must be plugged now instead of using the micro-USB slot as a power source), and Arduino sketch uploads. Also connect the Edison to the Gateway’s wifi hotspot to ensure it can see the webcam.

To ensure the USB webcam is working, type the following into a serial connection.

ls -l /dev/video0

A line similar to this one should appear:

crw-rw---- 1 root video 81, 0 May  6 22:36 /dev/video0

Otherwise, this line will appear indicated the camera is not found.

ls: cannot access /dev/video0: No such file or directory

In the early stages of the project, the Intel Edison was using the FFMEG library to capture an image and then send it over MQTT to the gateway. This method had some draw backs as each image took a few seconds just to be saved which was way too slow for practical application. To combat this and make images ready to the gateway on-demand, the setup switched to have the Intel Edison continuously streaming a feed that the gateway could capture from at any time. This was done using the mjpeg-streamer library, to install it on the Intel Edison.

Add the following lines to base-feeds.conf:

echo "src/gz all http://repo.opkg.net/edison/repo/all
src/gz edison http://repo.opkg.net/edison/repo/edison
src/gz core2-32 http://repo.opkg.net/edison/repo/core2-32">> /etc/opkg/base-feeds.conf

Update the repository index:

opkg update

And install:

opkg install mjpg-streamer

To the start the stream:

mjpg_streamer -i "input_uvc.so -n -f 30 -r 800x600" -o "output_http.so -p 8080 -w ./www"

It was decided to use the MJEG compressed format to keep the frame rate high. However YUV format is uncompressed which leaves more detail for OpenCV, so experiment with the tradeoffs.

To view the stream while on the same Wi-Fi network visit: http://localhost:8080/?action=stream, a still image of the feed can also be viewed by going to: http://localhost:8080/?action=snapshot. Where localhost is the IP address of the Intel Edison connected to the Gateway’s wifi. On the Intel Gateway side, it sends an http request to the snapshot and then saves the image to disk.

 

Gateway

The brains of the whole security camera is on the gateway. OpenCV was installed into a virtual python environment to create a clean and segmented environment for OpenCV and not interfere with the system Python and packages. Basic install instructions for OpenCV linux can be found here: http://docs.opencv.org/2.4/doc/tutorials/introduction/linux_install/linux_install.html. These instructions need to be modified in order to install OpenCV and its dependencies on the Intel Wind River Gateway.

GCC, Git, and python2.7-dev are already installed.

Install CMake 2.6 or higher:

wget http://www.cmake.org/files/v3.2/cmake-3.2.2.tar.gz
tar xf cmake-3.2.2.tar.gz
cd cmake-3.2.2
./configure
make
make install

As the Wind River Linux environment has no apt-get command, it can quickly become a challenge to install the needed development packages. An easy way around this is to first install them on other 64 bit Linux machine (running Ubuntu in this case) and then manually copy the files to the gateway. Full file list can be found on the Ubuntu site here: http://packages.ubuntu.com/. For example, for the libtiff4-dev package, files in /usr/include/<file> should go to the same location on the gateway and files in /usr/lib/x86_64-linux-gnu/<file> should got into /usr/lib/<file>. The full list of files can be found here: http://packages.ubuntu.com/precise/amd64/libtiff4-dev/filelist. Install and copy the files over for packages listed below.

sudo apt-get install  libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install libjpeg8-dev libpng12-dev libtiff4-dev libjasper-dev  libv4l-dev

Install pip, this will help install a number of other dependencies.

wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py

Install the virutalenv, this will create a separate environment for OpenCV.

pip install virtualenv virtualenvwrapper

Once the virtualenv has been installed, create one called “cv.”

export WORKON_HOME=$HOME/.virtualenvs
mkvirtualenv cv

Note that all the following steps are done while the “cv” environment is activated. Once “cv” has been created, it will activate the environment automatically in the current session. This can be seen in the command prompt at the beginning eg. (cv) root@WR-IDP-NAME. For future sessions it can be activated with the following command:

. ~/.virtualenvs/cv/bin/activate

And similarly be deactivated (do not deactivated it yet):

deactivate

Install numpy:

pip install numpy

Get the OpenCV Source Code:

cd ~
git clone https://github.com/Itseez/opencv.git
cd opencv
git checkout 3.0.0

And make it:

mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=/usr/local \
-D INSTALL_C_EXAMPLES=ON \
-D INSTALL_PYTHON_EXAMPLES=ON \
-D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
-D BUILD_EXAMPLES=ON \
-D PYTHON_INCLUDE_DIR=/usr/include/python2.7/ \
-D PYTHON_INCLUDE_DIR2=/usr/include/python2.7 \
-D PYTHON_LIBRARY=/usr/lib64/libpython2.7.so \
-D PYTHON_PACKAGES_PATH=/usr/lib64/python2.7/site-packages/ \
-D BUILD_NEW_PYTHON_SUPPORT=ON \
-D PYTHON2_LIBRARY=/usr/lib64/libpython2.7.so \
-D BUILD_opencv_python3=OFF \
-D BUILD_opencv_python2=ON ..

It may be the case that the cv2.so file is not created. If this is the case, make OpenCV on the host Linux machine as well and copy the file over to /usr/lib64/python2.7/site-packages.

Webcam capture of people outside with OpenCV detection markers

Figure 4: Webcam capture of people outside with OpenCV detection markers

To quickly create a program and connect a large number of capabilities and services together as like with this project, Node-RED* was used. Node-RED is a quick prototyping tool that allows the user to visually wire together hardware devices, APIs, and various services. It also comes pre-installed on the gateway, just make sure to update to the latest version.

Node-Red Flow

Figure 5: Node-RED Flow

Once a message is injected in at the “Start” node, the script will loop continuously after processing the image or encountering an error. A few nodes of note for the setup are the http request, the python script, and the function message for the tweet. The “Repeat” node is to visually simplify the repeat flow into one node instead of pointing all three flows back to the beginning.

The “http request” node sends a GET message to the Intel Edison IP webcam’s snapshot URL. If it is successful the flow saves the image, otherwise it tweets an error message about the webcam.

Node-Red http GET request node details

Figure 6: Node-RED http GET request node details

To run the python script, create an “exec” node (it will be in the advanced section in Node-RED) with the command “/root/.virtualenvs/cv/bin/python2.7 /root/PeopleDetection.py”. This allows the script to run in the virtual python environment where OpenCV is installed.

Node-Red exec node details

Figure 7: Node-RED exec node details

The python script itself is fairly simple, it checks the image for people using the HOG algorithm and then looks for faces using the haarcasade frontal face alt algorithm that comes installed with openCV. It also saves out an image with boxes draw around found people and faces. The code below is by no means optimized for our proof of concept beyond tweaking some of the algorithm inputs to suit our purposes. There is the option of scaling the image down before analyzing it to reduce the time, see the code snippet below for how to do that. It takes the Gateway roughly 0.33 seconds to process an image. For comparison, an Intel Edison takes around 10 seconds to process the same image.  Depending on where the camera is located and how far or close people are expected to be to it, the OpenCV algorithm parameters may need to change to better fit the situation.

import numpy as np
import cv2
import sys
import datetime

def draw_detections(img, rects, rects2, thickness = 2):
  for x, y, w, h in rects:
    pad_w, pad_h = int(0.15*w), int(0.05*h)
    cv2.rectangle(img, (x+pad_w, y+pad_h), (x+w-pad_w, y+h-pad_h), (0, 255, 0), thickness)
    print("Person Detected")
  for (x,y,w,h) in rects2:
    cv2.rectangle(img,(x,y),(x+w,y+h),(255,0,0),thickness)
    print("Face Detected")

total = datetime.datetime.now()

img = cv2.imread('/root/incoming.jpg')
#optional resize of image to make processing faster
#img = cv2.resize(img, (0,0), fx=0.5, fy=0.5)

hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
peopleFound,a=hog.detectMultiScale(img, winStride=(8,8), padding=(16,16), scale=1.3)

faceCascade = cv2.CascadeClassifier('/root/haarcascade_frontalface_alt.xml')
facesFound = faceCascade.detectMultiScale(img,scaleFactor=1.1,minNeighbors=5,minSize=(30,30), flags = cv2.CASCADE_SCALE_IMAGE)

draw_detections(img,peopleFound,facesFound)

cv2.imwrite('/root/out_faceandpeople.jpg',img)

print("[INFO] total took: {}s".format(
 (datetime.datetime.now() - total).total_seconds()))

To send an image to Twitter, the tweet is constructed in a function node using the msg.media as the image variable and the msg.payload as the tweet string.

Node-Red function message node details

Figure 8: Node-RED function message node details

And of course, the system needs to be able to take pictures on demand as well. Node-RED monitors the same twitter feed for posts that contain “spy” or “Spy” and will post a current picture to Twitter. So posting a tweet with the word “spy” in it will trigger the Gateway to take a picture.

Node-Red flow for taking pictures on demand

Figure 8: Node-RED flow for taking pictures on demand

Summary

This concludes the proof of concept computing at the edge smarter security camera gateway. The Wind River Linux Gateway comes with a number of tools pre-installed and ready to prototype quickly. From here the project can be further optimized, made more robust with security features, and even expanded to create smart lighting for rooms when a person is detected.

 

About the author

Whitney Foster is a software engineer at Intel in the Software Solutions Group working on scale enabling projects for Internet of Things.

 

Notices

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries.

 

*Other names and brands may be claimed as the property of others

© 2016 Intel Corporation.

 

Intel® XDK FAQs - General

$
0
0

How can I get started with Intel XDK?

There are plenty of videos and articles that you can go through here to get started. You could also start with some of our demo apps. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

Having prior understanding of how to program using HTML, CSS and JavaScript* is crucial to using the Intel XDK. The Intel XDK is primarily a tool for visualizing, debugging and building an app package for distribution.

You can do the following to access our demo apps:

  • Select Project tab
  • Select "Start a New Project"
  • Select "Samples and Demos"
  • Create a new project from a demo

If you have specific questions following that, please post it to our forums.

How do I convert my web app or web site into a mobile app?

The Intel XDK creates Cordova mobile apps (aka PhoneGap apps). Cordova web apps are driven by HTML5 code (HTML, CSS and JavaScript). There is no web server in the mobile device to "serve" the HTML pages in your Cordova web app, the main program resources required by your Cordova web app are file-based, meaning all of your web app resources are located within the mobile app package and reside on the mobile device. Your app may also require resources from a server. In that case, you will need to connect with that server using AJAX or similar techniques, usually via a collection of RESTful APIs provided by that server. However, your app is not integrated into that server, the two entities are independent and separate.

Many web developers believe they should be able to include PHP or Java code or other "server-based" code as an integral part of their Cordova app, just as they do in a "dynamic web app." This technique does not work in a Cordova web app, because your app does not reside on a server, there is no "backend"; your Cordova web app is a "front-end" HTML5 web app that runs independent of any servers. See the following articles for more information on how to move from writing "multi-page dynamic web apps" to "single-page Cordova web apps":

Can I use an external editor for development in Intel® XDK?

Yes, you can open your files and edit them in your favorite editor. However, note that you must use Brackets* to use the "Live Layout Editing" feature. Also, if you are using App Designer (the UI layout tool in Intel XDK) it will make many automatic changes to your index.html file, so it is best not to edit that file externally at the same time you have App Designer open.

Some popular editors among our users include:

  • Sublime Text* (Refer to this article for information on the Intel XDK plugin for Sublime Text*)
  • Notepad++* for a lighweight editor
  • Jetbrains* editors (Webstorm*)
  • Vim* the editor

How do I get code refactoring capability in Brackets* (the Intel XDK code editor)?

...to be written...

Why doesn’t my app show up in Google* play for tablets?

...to be written...

What is the global-settings.xdk file and how do I locate it?

global-settings.xdk contains information about all your projects in the Intel XDK, along with many of the settings related to panels under each tab (Emulate, Debug etc). For example, you can set the emulator to auto-refresh or no-auto-refresh. Modify this file at your own risk and always keep a backup of the original!

You can locate global-settings.xdk here:

  • Mac OS X*
    ~/Library/Application Support/XDK/global-settings.xdk
  • Microsoft Windows*
    %LocalAppData%\XDK
  • Linux*
    ~/.config/XDK/global-settings.xdk

If you are having trouble locating this file, you can search for it on your system using something like the following:

  • Windows:
    > cd /
    > dir /s global-settings.xdk
  • Mac and Linux:
    $ sudo find / -name global-settings.xdk

When do I use the intelxdk.js, xhr.js and cordova.js libraries?

The intelxdk.js and xhr.js libraries were only required for use with the Intel XDK legacy build tiles (which have been retired). The cordova.js library is needed for all Cordova builds. When building with the Cordova tiles, any references to intelxdk.js and xhr.js libraries in your index.html file are ignored.

How do I get my Android (and Crosswalk) keystore file?

New with release 3088 of the Intel XDK, you may now download your build certificates (aka keystore) using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Convert a Legacy Android Certificate" in that document, for details regarding how to do this.

It may also help to review this short, quick overview video (there is no audio) that shows how you convert your existing "legacy" certificates to the "new" format that allows you to directly manage your certificates using the certificate management tool that is built into the Intel XDK. This conversion process is done only once.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I rename my project that is a duplicate of an existing project?

See this FAQ: How do I make a copy of an existing Intel XDK project?

How do I recover when the Intel XDK hangs or won't start?

  • If you are running Intel XDK on Windows* it must be Windows* 7 or higher. It will not run reliably on earlier versions.
  • Delete the "project-name.xdk" file from the project directory that Intel XDK is trying to open when it starts (it will try to open the project that was open during your last session), then try starting Intel XDK. You will have to "import" your project into Intel XDK again. Importing merely creates the "project-name.xdk" file in your project directory and adds that project to the "global-settings.xdk" file.
  • Rename the project directory Intel XDK is trying to open when it starts. Create a new project based on one of the demo apps. Test Intel XDK using that demo app. If everything works, restart Intel XDK and try it again. If it still works, rename your problem project folder back to its original name and open Intel XDK again (it should now open the sample project you previously opened). You may have to re-select your problem project (Intel XDK should have forgotten that project during the previous session).
  • Clear Intel XDK's program cache directories and files.

    On a Windows machine this can be done using the following on a standard command prompt (administrator is not required):

    > cd %AppData%\..\Local\XDK
    > del *.* /s/q

    To locate the "XDK cache" directory on [OS X*] and [Linux*] systems, do the following:

    $ sudo find / -name global-settings.xdk
    $ cd <dir found above>
    $ sudo rm -rf *

    You might want to save a copy of the "global-settings.xdk" file before you delete that cache directory and copy it back before you restart Intel XDK. Doing so will save you the effort of rebuilding your list of projects. Please refer to this question for information on how to locate the global-settings.xdk file.
  • If you save the "global-settings.xdk" file and restored it in the step above and you're still having hang troubles, try deleting the directories and files above, along with the "global-settings.xdk" file and try it again.
  • Do not store your project directories on a network share (Intel XDK currently has issues with network shares that have not yet been resolved). This includes folders shared between a Virtual machine (VM) guest and its host machine (for example, if you are running Windows* in a VM running on a Mac* host). This network share issue is a known issue with a fix request in place.
  • There have also been issues with running behind a corporate network proxy or firewall. To check them try running Intel XDK from your home network where, presumably, you have a simple NAT router and no proxy or firewall. If things work correctly there then your corporate firewall or proxy may be the source of the problem.
  • Issues with Intel XDK account logins can also cause Intel XDK to hang. To confirm that your login is working correctly, go to the Intel XDK App Center and confirm that you can login with your Intel XDK account. While you are there you might also try deleting the offending project(s) from the App Center.

If you can reliably reproduce the problem, please send us a copy of the "xdk.log" file that is stored in the same directory as the "global-settings.xdk" file to html5tools@intel.com.

Is Intel XDK an open source project? How can I contribute to the Intel XDK community?

No, It is not an open source project. However, it utilizes many open source components that are then assembled into Intel XDK. While you cannot contribute directly to the Intel XDK integration effort, you can contribute to the many open source components that make up Intel XDK.

The following open source components are the major elements that are being used by Intel XDK:

  • Node-Webkit
  • Chromium
  • Ripple* emulator
  • Brackets* editor
  • Weinre* remote debugger
  • Crosswalk*
  • Cordova*
  • App Framework*

How do I configure Intel XDK to use 9 patch png for Android* apps splash screen?

Intel XDK does support the use of 9 patch png for Android* apps splash screen. You can read up more at https://software.intel.com/en-us/xdk/articles/android-splash-screens-using-nine-patch-png on how to create a 9 patch png image and link to an Intel XDK sample using 9 patch png images.

How do I stop AVG from popping up the "General Behavioral Detection" window when Intel XDK is launched?

You can try adding nw.exe as the app that needs an exception in AVG.

What do I specify for "App ID" in Intel XDK under Build Settings?

Your app ID uniquely identifies your app. For example, it can be used to identify your app within Apple’s application services allowing you to use things like in-app purchasing and push notifications.

Here are some useful articles on how to create an App ID:

Is it possible to modify the Android Manifest or iOS plist file with the Intel XDK?

You cannot modify the AndroidManifest.xml file directly with our build system, as it only exists in the cloud. However, you may do so by creating a dummy plugin that only contains a plugin.xml file containing directives that can be used to add lines to the AndroidManifest.xml file during the build process. In essence, you add lines to the AndroidManifest.xml file via a local plugin.xml file. Here is an example of a plugin that does just that:

<?xml version="1.0" encoding="UTF-8"?><plugin xmlns="http://apache.org/cordova/ns/plugins/1.0" id="my-custom-intents-plugin" version="1.0.0"><name>My Custom Intents Plugin</name><description>Add Intents to the AndroidManifest.xml</description><license>MIT</license><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- android --><platform name="android"><config-file target="AndroidManifest.xml" parent="/manifest/application"><activity android:configChanges="orientation|keyboardHidden|keyboard|screenSize|locale" android:label="@string/app_name" android:launchMode="singleTop" android:name="testa" android:theme="@android:style/Theme.Black.NoTitleBar"><intent-filter><action android:name="android.intent.action.SEND" /><category android:name="android.intent.category.DEFAULT" /><data android:mimeType="*/*" /></intent-filter></activity></config-file></platform></plugin>

You can inspect the AndroidManifest.xml created in an APK, using apktool with the following command line:

$ apktool d my-app.apk
$ cd my-app
$ more AndroidManifest.xml

This technique exploits the config-file element that is described in the Cordova Plugin Specification docs and can also be used to add lines to iOS plist files. See the Cordova plugin documentation link for additional details.

Here is an example of such a plugin for modifying the iOS plist file, specifically for adding a BIS key to the plist file:

<?xml version="1.0" encoding="UTF-8"?><plugin
    xmlns="http://apache.org/cordova/ns/plugins/1.0"
    id="my-custom-bis-plugin"
    version="0.0.2"><name>My Custom BIS Plugin</name><description>Add BIS info to iOS plist file.</description><license>BSD-3</license><preference name="BIS_KEY" /><engines><engine name="cordova" version=">=3.0.0" /></engines><!-- ios --><platform name="ios"><config-file target="*-Info.plist" parent="CFBundleURLTypes"><array><dict><key>ITSAppUsesNonExemptEncryption</key><true/><key>ITSEncryptionExportComplianceCode</key><string>$BIS_KEY</string></dict></array></config-file></platform></plugin>

How can I share my Intel XDK app build?

You can send a link to your project via an email invite from your project settings page. However, a login to your account is required to access the file behind the link. Alternatively, you can download the build from the build page, onto your workstation, and push that built image to some location from which you can send a link to that image.

Why does my iOS build fail when I am able to test it successfully on a device and the emulator?

Common reasons include:

  • Your App ID specified in the project settings do not match the one you specified in Apple's developer portal.
  • The provisioning profile does not match the cert you uploaded. Double check with Apple's developer site that you are using the correct and current distribution cert and that the provisioning profile is still active. Download the provisioning profile again and add it to your project to confirm.
  • In Project Build Settings, your App Name is invalid. It should be modified to include only alpha, space and numbers.

How do I add multiple domains in Domain Access?

Here is the primary doc source for that feature.

If you need to insert multiple domain references, then you will need to add the extra references in the intelxdk.config.additions.xml file. This StackOverflow entry provides a basic idea and you can see the intelxdk.config.*.xml files that are automatically generated with each build for the <access origin="xxx" /> line that is generated based on what you provide in the "Domain Access" field of the "Build Settings" panel on the Project Tab.

How do I build more than one app using the same Apple developer account?

On Apple developer, create a distribution certificate using the "iOS* Certificate Signing Request" key downloaded from Intel XDK Build tab only for the first app. For subsequent apps, reuse the same certificate and import this certificate into the Build tab like you usually would.

How do I include search and spotlight icons as part of my app?

Please refer to this article in the Intel XDK documentation. Create anintelxdk.config.additions.xml file in your top level directory (same location as the otherintelxdk.*.config.xml files) and add the following lines for supporting icons in Settings and other areas in iOS*.

<!-- Spotlight Icon --><icon platform="ios" src="res/ios/icon-40.png" width="40" height="40" /><icon platform="ios" src="res/ios/icon-40@2x.png" width="80" height="80" /><icon platform="ios" src="res/ios/icon-40@3x.png" width="120" height="120" /><!-- iPhone Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-small.png" width="29" height="29" /><icon platform="ios" src="res/ios/icon-small@2x.png" width="58" height="58" /><icon platform="ios" src="res/ios/icon-small@3x.png" width="87" height="87" /><!-- iPad Spotlight and Settings Icon --><icon platform="ios" src="res/ios/icon-50.png" width="50" height="50" /><icon platform="ios" src="res/ios/icon-50@2x.png" width="100" height="100" />

For more information related to these configurations, visit http://cordova.apache.org/docs/en/3.5.0/config_ref_images.md.html#Icons%20and%20Splash%20Screens.

For accurate information related to iOS icon sizes, visit https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/MobileHIG/IconMatrix.html

NOTE: The iPhone 6 icons will only be available if iOS* 7 or 8 is the target.

Cordova iOS* 8 support JIRA tracker: https://issues.apache.org/jira/browse/CB-7043

Does Intel XDK support Modbus TCP communication?

No, since Modbus is a specialized protocol, you need to write either some JavaScript* or native code (in the form of a plugin) to handle the Modbus transactions and protocol.

How do I sign an Android* app using an existing keystore?

New with release 3088 of the Intel XDK, you may now import your existing keystore into Intel XDK using the new certificate manager that is built into the Intel XDK. Please read the initial paragraphs of Managing Certificates for your Intel XDK Account and the section titled "Import an Android Certificate Keystore" in that document, for details regarding how to do this.

If the above fails, please send an email to html5tools@intel.com requesting help. It is important that you send that email from the email address associated with your Intel XDK account.

How do I build separately for different Android* versions?

Under the Projects Panel, you can select the Target Android* version under the Build Settings collapsible panel. You can change this value and build your application multiple times to create numerous versions of your application that are targeted for multiple versions of Android*.

How do I display the 'Build App Now' button if my display language is not English?

If your display language is not English and the 'Build App Now' button is proving to be troublesome, you may change your display language to English which can be downloaded by a Windows* update. Once you have installed the English language, proceed to Control Panel > Clock, Language and Region > Region and Language > Change Display Language.

How do I update my Intel XDK version?

When an Intel XDK update is available, an Update Version dialog box lets you download the update. After the download completes, a similar dialog lets you install it. If you did not download or install an update when prompted (or on older versions), click the package icon next to the orange (?) icon in the upper-right to download or install the update. The installation removes the previous Intel XDK version.

How do I import my existing HTML5 app into the Intel XDK?

If your project contains an Intel XDK project file (<project-name>.xdk) you should use the "Open an Intel XDK Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round green "eject" icon, on the Projects tab). This would be the case if you copied an existing Intel XDK project from another system or used a tool that exported a complete Intel XDK project.

If your project does not contain an Intel XDK project file (<project-name>.xdk) you must "import" your code into a new Intel XDK project. To import your project, use the "Start a New Project" option located at the bottom of the Projects List on the Projects tab (lower left of the screen, round blue "plus" icon, on theProjects tab). This will open the "Samples, Demos and Templates" page, which includes an option to "Import Your HTML5 Code Base." Point to the root directory of your project. The Intel XDK will attempt to locate a file named index.html in your project and will set the "Source Directory" on the Projects tab to point to the directory that contains this file.

If your imported project did not contain an index.html file, your project may be unstable. In that case, it is best to delete the imported project from the Intel XDK Projects tab ("x" icon in the upper right corner of the screen), rename your "root" or "main" html file to index.html and import the project again. Several components in the Intel XDK depend on this assumption that the main HTML file in your project is named index.hmtl. See Introducing Intel® XDK Development Tools for more details.

It is highly recommended that your "source directory" be located as a sub-directory inside your "project directory." This insures that non-source files are not included as part of your build package when building your application. If the "source directory" and "project directory" are the same it results in longer upload times to the build server and unnecessarily large application executable files returned by the build system. See the following images for the recommended project file layout.

I am unable to login to App Preview with my Intel XDK password.

On some devices you may have trouble entering your Intel XDK login password directly on the device in the App Preview login screen. In particular, sometimes you may have trouble with the first one or two letters getting lost when entering your password.

Try the following if you are having such difficulties:

  • Reset your password, using the Intel XDK, to something short and simple.

  • Confirm that this new short and simple password works with the XDK (logout and login to the Intel XDK).

  • Confirm that this new password works with the Intel Developer Zone login.

  • Make sure you have the most recent version of Intel App Preview installed on your devices. Go to the store on each device to confirm you have the most recent copy of App Preview installed.

  • Try logging into Intel App Preview on each device with this short and simple password. Check the "show password" box so you can see your password as you type it.

If the above works, it confirms that you can log into your Intel XDK account from App Preview (because App Preview and the Intel XDK go to the same place to authenticate your login). When the above works, you can go back to the Intel XDK and reset your password to something else, if you do not like the short and simple password you used for the test.

If you are having trouble logging into any pages on the Intel web site (including the Intel XDK forum), please see the Intel Sign In FAQ for suggestions and contact info. That login system is the backend for the Intel XDK login screen.

How do I completely uninstall the Intel XDK from my system?

Take the following steps to completely uninstall the XDK from your Windows system:

  • From the Windows Control Panel, remove the Intel XDK, using the Windows uninstall tool.

  • Then:
    > cd %LocalAppData%\Intel\XDK
    > del *.* /s/q

  • Then:
    > cd %LocalAppData%\XDK
    > copy global-settings.xdk %UserProfile%
    > del *.* /s/q
    > copy %UserProfile%\global-settings.xdk .

  • Then:
    -- Goto xdk.intel.com and select the download link.
    -- Download and install the new XDK.

To do the same on a Linux or Mac system:

  • On a Linux machine, run the uninstall script, typically /opt/intel/XDK/uninstall.sh.
     
  • Remove the directory into which the Intel XDK was installed.
    -- Typically /opt/intel or your home (~) directory on a Linux machine.
    -- Typically in the /Applications/Intel XDK.app directory on a Mac.
     
  • Then:
    $ find ~ -name global-settings.xdk
    $ cd <result-from-above> (for example ~/Library/Application Support/XDK/ on a Mac)
    $ cp global-settings.xdk ~
    $ rm -Rf *
    $ mv ~/global-settings.xdk .

     
  • Then:
    -- Goto xdk.intel.com and select the download link.
    -- Download and install the new XDK.

Is there a tool that can help me highlight syntax issues in Intel XDK?

Yes, you can use the various linting tools that can be added to the Brackets editor to review any syntax issues in your HTML, CSS and JS files. Go to the "File > Extension Manager..." menu item and add the following extensions: JSHint, CSSLint, HTMLHint, XLint for Intel XDK. Then, review your source files by monitoring the small yellow triangle at the bottom of the edit window (a green check mark indicates no issues).

How do I delete built apps and test apps from the Intel XDK build servers?

You can manage them by logging into: https://appcenter.html5tools-software.intel.com/csd/controlpanel.aspx. This functionality will eventually be available within Intel XDK after which access to app center will be removed.

I need help with the App Security API plugin; where do I find it?

Visit the primary documentation book for the App Security API and see this forum post for some additional details.

When I install my app or use the Debug tab Avast antivirus flags a possible virus, why?

If you are receiving a "Suspicious file detected - APK:CloudRep [Susp]" message from Avast anti-virus installed on your Android device it is due to the fact that you are side-loading the app (or the Intel XDK Debug modules) onto your device (using a download link after building or by using the Debug tab to debug your app), or your app has been installed from an "untrusted" Android store. See the following official explanation from Avast:

Your application was flagged by our cloud reputation system. "Cloud rep" is a new feature of Avast Mobile Security, which flags apks when the following conditions are met:

  1. The file is not prevalent enough; meaning not enough users of Avast Mobile Security have installed your APK.
  2. The source is not an established market (Google Play is an example of an established market).

If you distribute your app using Google Play (or any other trusted market) your users should not see any warning from Avast.

Following are some of the Avast anti-virus notification screens you might see on your device. All of these are perfectly normal, they are due to the fact that you must enable the installation of "non-market" apps in order to use your device for debug and the App IDs associated with your never published app or the custom debug modules that the Debug tab in the Intel XDK builds and installs on your device will not be found in a "established" (aka "trusted") market, such as Google Play.

If you choose to ignore the "Suspicious app activity!" threat you will not receive a threat for that debug module any longer. It will show up in the Avast 'ignored issues' list. Updates to an existing, ignored, custom debug module should continue to be ignored by Avast. However, new custom debug modules (due to a new project App ID or a new version of Crosswalk selected in your project's Build Settings) will result in a new warning from the Avast anti-virus tool.

  

  

How do I add a Brackets extension to the editor that is part of the Intel XDK?

The number of Brackets extensions that are provided in the built-in edition of the Brackets editor are limited to insure stability of the Intel XDK product. Not all extensions are compatible with the edition of Brackets that is embedded within the Intel XDK. Adding incompatible extensions can cause the Intel XDK to quit working.

Despite this warning, there are useful extensions that have not been included in the editor and which can be added to the Intel XDK. Adding them is temporary, each time you update the Intel XDK (or if you reinstall the Intel XDK) you will have to "re-add" your Brackets extension. To add a Brackets extension, use the following procedure:

  • exit the Intel XDK
  • download a ZIP file of the extension you wish to add
  • on Windows, unzip the extension here:
    %LocalAppData%\Intel\XDK\xdk\brackets\b\extensions\dev
  • on Mac OS X, unzip the extension here:
    /Applications/Intel\ XDK.app/Contents/Resources/app.nw/brackets/b/extensions/dev
  • start the Intel XDK

Note that the locations given above are subject to change with new releases of the Intel XDK.

Why does my app or game require so many permissions on Android when built with the Intel XDK?

When you build your HTML5 app using the Intel XDK for Android or Android-Crosswalk you are creating a Cordova app. It may seem like you're not building a Cordova app, but you are. In order to package your app so it can be distributed via an Android store and installed on an Android device, it needs to be built as a hybrid app. The Intel XDK uses Cordova to create that hybrid app.

A pure Cordova app requires the NETWORK permission, it's needed to "jump" between your HTML5 environment and the native Android environment. Additional permissions will be added by any Cordova plugins you include with your application; which permissions are includes are a function of what that plugin does and requires.

Crosswalk for Android builds also require the NETWORK permission, because the Crosswalk image built by the Intel XDK includes support for Cordova. In addition, current versions of Crosswalk (12 and 14 at the time this FAQ was written)also require NETWORK STATE and WIFI STATE. There is an extra permission in some versions of Crosswalk (WRITE EXTERNAL STORAGE) that is only needed by the shared model library of Crosswalk, we have asked the Crosswalk project to remove this permission in a future Crosswalk version.

If you are seeing more than the following five permissions in your XDK-built Crosswalk app:

  • android.permission.INTERNET
  • android.permission.ACCESS_NETWORK_STATE
  • android.permission.ACCESS_WIFI_STATE
  • android.permission.INTERNET
  • android.permission.WRITE_EXTERNAL_STORAGE

then you are seeing permissions that have been added by some plugins. Each plugin is different, so there is no hard rule of thumb. The two "default" core Cordova plugins that are added by the Intel XDK blank templates (device and splash screen) do not require any Android permissions.

BTW: the permission list above comes from a Crosswalk 14 build. Crosswalk 12 builds do not included the last permission; it was added when the Crosswalk project introduced the shared model library option, which started with Crosswalk 13 (the Intel XDK does not support 13 builds).

How do I make a copy of an existing Intel XDK project?

If you just need to make a backup copy of an existing project, and do not plan to open that backup copy as a project in the Intel XDK, do the following:

  • Exit the Intel XDK.
  • Copy the entire project directory:
    • on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
    • on Mac use Finder to "right-click" and then "duplicate" your project directory
    • on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)

If you want to use an existing project as the starting point of a new project in the Intel XDK. The process described below will insure that the build system does not confuse the ID in your old project with that stored in your new project. If you do not follow the procedure below you will have multiple projects using the same project ID (a special GUID that is stored inside the Intel XDK <project-name>.xdk file in the root directory of your project). Each project in your account must have a unique project ID.

  • Exit the Intel XDK.
  • Make a copy of your existing project using the process described above.
  • Inside the new project that you made (that is, your new copy of your old project), make copies of the <project-name>.xdk file and <project-name>.xdke files and rename those copies to something like project-new.xdk and project-new.xdke (anything you like, just something different than the original project name, preferably the same name as the new project folder in which you are making this new project).
  • Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open your new "project-new.xdk" file (whatever you named it) and find the projectGuid line, it will look something like this:
    "projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
  • Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
  • Save the modified "project-new.xdk" file.
  • Open the Intel XDK.
  • Goto the Projects tab.
  • Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
  • To open this new project, locate the new "project-new.xdk" file inside the new project folder you copied above.
  • Don't forget to change the App ID in your new project. This is necessary to avoid conflicts with the project you copied from, in the store and when side-loading onto a device.

My project does not include a www folder. How do I fix it so it includes a www or source directory?

The Intel XDK HTML5 and Cordova project file structures are meant to mimic a standard Cordova project. In a Cordova (or PhoneGap) project there is a subdirectory (or folder) named www that contains all of the HTML5 source code and asset files that make up your application. For best results, it is advised that you follow this convention, of putting your source inside a "source directory" inside of your project folder.

This most commonly happens as the result of exporting a project from an external tool, such as Construct2, or as the result of importing an existing HTML5 web app that you are converting into a hybrid mobile application (eg., an Intel XDK Corodova app). If you would like to convert an existing Intel XDK project into this format, follow the steps below:

  • Exit the Intel XDK.
  • Copy the entire project directory:
    • on Windows, use File Explorer to "right-click" and "copy" your project directory, then "right-click" and "paste"
    • on Mac use Finder to "right-click" and then "duplicate" your project directory
    • on Linux, open a terminal window, "cd" to the folder that contains your project, and type "cp -a old-project/ new-project/" at the terminal prompt (where "old-project/" is the folder name of your existing project that you want to copy and "new-project/" is the name of the new folder that will contain a copy of your existing project)
  • Create a "www" directory inside the new duplicate project you just created above.
  • Move your index.html and other source and asset files to the "www" directory you just created -- this is now your "source" directory, located inside your "project" directory (do not move the <project-name>.xdk and xdke files and any intelxdk.config.*.xml files, those must stay in the root of the project directory)
  • Inside the new project that you made above (by making a copy of the old project), rename the <project-name>.xdk file and <project-name>.xdke files to something like project-copy.xdk and project-copy.xdke (anything you like, just something different than the original project, preferably the same name as the new project folder in which you are making this new project).
  • Using a TEXT EDITOR (only) (such as Notepad or Sublime or Brackets or some other TEXT editor), open the new "project-copy.xdk" file (whatever you named it) and find the line named projectGuid, it will look something like this:
    "projectGuid": "a863c382-ca05-4aa4-8601-375f9f209b67",
  • Change the "GUID" to all zeroes, like this: "00000000-0000-0000-000000000000"
  • A few lines down find: "sourceDirectory": "",
  • Change it to this: "sourceDirectory": "www",
  • Save the modified "project-copy.xdk" file.
  • Open the Intel XDK.
  • Goto the Projects tab.
  • Select "Open an Intel XDK Project" (the green button at the bottom left of the Projects tab).
  • To open this new project, locate the new "project-copy.xdk" file inside the new project folder you copied above.

Can I install more than one copy of the Intel XDK onto my development system?

Yes, you can install more than one version onto your development system. However, you cannot run multiple instances of the Intel XDK at the same time. Be aware that new releases sometimes change the project file format, so it is a good idea, in these cases, to make a copy of your project if you need to experiment with a different version of the Intel XDK. See the instructions in a FAQ entry above regarding how to make a copy of your Intel XDK project.

Follow the instructions in this forum post to install more than one copy of the Intel XDK onto your development system.

On Apple OS X* and Linux* systems, does the Intel XDK need the OpenSSL* library installed?

Yes. Several features of the Intel XDK require the OpenSSL library, which typically comes pre-installed on Linux and OS X systems. If the Intel XDK reports that it could not find libssl, go to https://www.openssl.org to download and install it.

I have a web application that I would like to distribute in app stores without major modifications. Is this possible using the Intel XDK?

Yes, if you have a true web app or “client app” that only uses HTML, CSS and JavaScript, it is usually not too difficult to convert it to a Cordova hybrid application (this is what the Intel XDK builds when you create an HTML5 app). If you rely heavily on PHP or other server scripting languages embedded in your pages you will have more work to do. Because your Cordova app is not associated with a server, you cannot rely on server-based programming techniques; instead, you must rewrite any such code to user RESTful APIs that your app interacts with using, for example, AJAX calls.

What is the best training approach to using the Intel XDK for a newbie?

First, become well-versed in the art of client web apps, apps that rely only on HTML, CSS and JavaScript and utilize RESTful APIs to talk to network services. With that you will have mastered 80% of the problem. After that, it is simply a matter of understanding how Cordova plugins are able to extend the JavaScript API for access to features of the platform. For HTML5 training there are many sites providing tutorials. It may also help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, which will help you understand some of the differences between developing for a traditional server-based environment and developing for the Intel XDK hybrid Cordova app environment.

What is the best platform to start building an app with the Intel XDK? And what are the important differences between the Android, iOS and other mobile platforms?

There is no one most important difference between the Android, iOS and other platforms. It is important to understand that the HTML5 runtime engine that executes your app on each platform will vary as a function of the platform. Just as there are differences between Chrome and Firefox and Safari and Internet Explorer, there are differences between iOS 9 and iOS 8 and Android 4 and Android 5, etc. Android has the most significant differences between vendors and versions of Android. This is one of the reasons the Intel XDK offers the Crosswalk for Android build option, to normalize and update the Android issues.

In general, if you can get your app working well on Android (or Crosswalk for Android) first you will generally have fewer issues to deal with when you start to work on the iOS and Windows platforms. In addition, the Android platform has the most flexible and useful debug options available, so it is the easiest platform to use for debugging and testing your app.

Is my password encrypted and why is it limited to fifteen characters?

Yes, your password is stored encrypted and is managed by https://signin.intel.com. Your Intel XDK userid and password can also be used to log into the Intel XDK forum as well as the Intel Developer Zone. the Intel XDK does not store nor does it manage your userid and password.

The rules regarding allowed userids and passwords are answered on this Sign In FAQ page, where you can also find help on recovering and changing your password.

Why does the Intel XDK take a long time to start on Linux or Mac?

...and why am I getting this error message? "Attempt to contact authentication server is taking a long time. You can wait, or check your network connection and try again."

At startup, the Intel XDK attempts to automatically determine the proxy settings for your machine. Unfortunately, on some system configurations it is unable to reliably detect your system proxy settings. As an example, you might see something like this image when starting the Intel XDK.

On some systems you can get around this problem by setting some proxy environment variables and then starting the Intel XDK from a command-line that includes those configured environment variables. To set those environment variables, similar to the following:

$ export no_proxy="localhost,127.0.0.1/8,::1"
$ export NO_PROXY="localhost,127.0.0.1/8,::1"
$ export http_proxy=http://proxy.mydomain.com:123/
$ export HTTP_PROXY=http://proxy.mydomain.com:123/
$ export https_proxy=http://proxy.mydomain.com:123/
$ export HTTPS_PROXY=http://proxy.mydomain.com:123/

IMPORTANT! The name of your proxy server and the port (or ports) that your proxy server requires will be different than those shown in the example above. Please consult with your IT department to find out what values are appropriate for your site. Intel has no way of knowing what configuration is appropriate for your network.

If you use the Intel XDK in multiple locations (at work and at home), you may have to change the proxy settings before starting the Intel XDK after switching to a new network location. For example, many work networks use a proxy server, but most home networks do not require such a configuration. In that case, you need to be sure to "unset" the proxy environment variables before starting the Intel XDK on a non-proxy network.

After you have successfully configured your proxy environment variables, you can start the Intel XDK manually, from the command-line.

On a Mac, where the Intel XDK is installed in the default location, type the following (from a terminal window that has the above environment variables set):

$ open /Applications/Intel\ XDK.app/

On a Linux machine, assuming the Intel XDK has been installed in the ~/intel/XDK directory, type the following (from a terminal window that has the above environment variables set):

$ ~/intel/XDK/xdk.sh &

In the Linux case, you will need to adjust the directory name that points to the xdk.sh file in order to start. The example above assumes a local install into the ~/intel/XDK directory. Since Linux installations have more options regarding the installation directory, you will need to adjust the above to suit your particular system and install directory.

How do I generate a P12 file on a Windows machine?

See these articles:

How do I change the default dir for creating new projects in the Intel XDK?

You can change the default new project location manually by modifying a field in the global-settings.xdk file. Locate the global-settings.xdk file on your system (the precise location varies as a function of the OS) and find this JSON object inside that file:

"projects-tab": {"defaultPath": "/Users/paul/Documents/XDK","LastSortType": "descending|Name","lastSortType": "descending|Opened","thirdPartyDisclaimerAcked": true
  },

The example above came from a Mac. On a Mac the global-settings.xdk file is located in the "~/Library/Application Support/XDK" directory.

On a Windows machine the global-settings.xdk file is normally found in the "%LocalAppData%\XDK" directory. The part you are looking for will look something like this:

"projects-tab": {"thirdPartyDisclaimerAcked": false,"LastSortType": "descending|Name","lastSortType": "descending|Opened","defaultPath": "C:\\Users\\paul/Documents"
  },

Obviously, it's the defaultPath part you want to change.

BE CAREFUL WHEN YOU EDIT THE GLOBAL-SETTINGS.XDK FILE!! You've been warned...

Make sure the result is proper JSON when you are done, or it may cause your XDK to cough and hack loudly. Make a backup copy of global-settings.xdk before you start, just in case.

Where I can find recent and upcoming webinars list?

How can I change the email address associated with my Intel XDK login?

Login to the Intel Developer Zone with your Intel XDK account userid and password and then locate your "account dashboard." Click the "pencil icon" next to your name to open the "Personal Profile" section of your account, where you can edit your "Name & Contact Info," including the email address associated with your account, under the "Private" section of your profile.

What network addresses must I enable in my firewall to insure the Intel XDK will work on my restricted network?

Normally, access to the external servers that the Intel XDK uses is handled automatically by your proxy server. However, if you are working in an environment that has restricted Internet access and you need to provide your IT department with a list of URLs that you need access to in order to use the Intel XDK, then please provide them with the following list of domain names:

  • appcenter.html5tools-software.intel.com (for communication with the build servers)
  • s3.amazonaws.com (for downloading sample apps and built apps)
  • download.xdk.intel.com (for getting XDK updates)
  • debug-software.intel.com (for using the Test tab weinre debug feature)
  • xdk-feed-proxy.html5tools-software.intel.com (for receiving the tweets in the upper right corner of the XDK)
  • signin.intel.com (for logging into the XDK)
  • sfederation.intel.com (for logging into the XDK)

Normally this should be handled by your network proxy (if you're on a corporate network) or should not be an issue if you are working on a typical home network.

I cannot create a login for the Intel XDK, how do I create a userid and password to use the Intel XDK?

If you have downloaded and installed the Intel XDK but are having trouble creating a login, you can create the login outside the Intel XDK. To do this, go to the Intel Developer Zone and push the "Join Today" button. After you have created your Intel Developer Zone login you can return to the Intel XDK and use that userid and password to login to the Intel XDK. This same userid and password can also be used to login to the Intel XDK forum.

Installing the Intel XDK on Windows fails with a "Package signature verification failed." message.

If you receive a "Package signature verification failed" message (see image below) when installing the Intel XDK on your system, it is likely due to one of the following two reasons:

  • Your system does not have a properly installed "root certificate" file, which is needed to confirm that the install package is good.
  • The install package is corrupt and failed the verification step.

The first case can happen if you are attempting to install the Intel XDK on an unsupported version of Windows. The Intel XDK is only supported on Microsoft Windows 7 and higher. If you attempt to install on Windows Vista (or earlier) you may see this verification error. The workaround is to install the Intel XDK on a Windows 7 or greater machine.

The second case is likely due to a corruption of the install package during download or due to tampering. The workaround is to re-download the install package and attempt another install.

If you are installing on a Windows 7 (or greater) machine and you see this message it is likely due to a missing or bad root certificate on your system. To fix this you may need to start the "Certificate Propagation" service. Open the Windows "services.msc" panel and then start the "Certificate Propagation" service. Additional links related to this problem can be found here > https://technet.microsoft.com/en-us/library/cc754841.aspx

See this forum thread for additional help regarding this issue > https://software.intel.com/en-us/forums/intel-xdk/topic/603992

Troubles installing the Intel XDK on a Linux or Ubuntu system, which option should I choose?

Choose the local user option, not root or sudo, when installing the Intel XDK on your Linux or Ubuntu system. This is the most reliable and trouble-free option and is the default installation option. This will insure that the Intel XDK has all the proper permissions necessary to execute properly on your Linux system. The Intel XDK will be installed in a subdirectory of your home (~) directory.

Inactive account/ login issue/ problem updating an APK in store, How do I request account transfer?

As of June 26, 2015 we migrated all of Intel XDK accounts to the more secure intel.com login system (the same login system you use to access this forum).

We have migrated nearly all active users to the new login system. Unfortunately, there are a few active user accounts that we could not automatically migrate to intel.com, primarily because the intel.com login system does not allow the use of some characters in userids that were allowed in the old login system.

If you have not used the Intel XDK for a long time prior to June 2015, your account may not have been automatically migrated. If you own an "inactive" account it will have to be manually migrated -- please try logging into the Intel XDK with your old userid and password, to determine if it no longer works. If you find that you cannot login to your existing Intel XDK account, and still need access to your old account, please send a message to html5tools@intel.com and include your userid and the email address associated with that userid, so we can guide you through the steps required to reactivate your old account.

Alternatively, you can create a new Intel XDK account. If you have submitted an app to the Android store from your old account you will need access to that old account to retrieve the Android signing certificates in order to upgrade that app on the Android store; in that case, send an email to html5tools@intel.com with your old account username and email and new account information.

Connection Problems? -- Intel XDK SSL certificates update

On January 26, 2016 we updated the SSL certificates on our back-end systems to SHA2 certificates. The existing certificates were due to expire in February of 2016. We have also disabled support for obsolete protocols.

If you are experiencing persistent connection issues (since Jan 26, 2016), please post a problem report on the forum and include in your problem report:

  • the operation that failed
  • the version of your XDK
  • the version of your operating system
  • your geographic region
  • and a screen capture

How do I resolve build failure: "libpng error: Not a PNG file"?  

f you are experiencing build failures with CLI 5 Android builds, and the detailed error log includes a message similar to the following:

Execution failed for task ':mergeArmv7ReleaseResources'.> Error: Failed to run command: /Developer/android-sdk-linux/build-tools/22.0.1/aapt s -i .../platforms/android/res/drawable-land-hdpi/screen.png -o .../platforms/android/build/intermediates/res/armv7/release/drawable-land-hdpi-v4/screen.png

Error Code: 42

Output: libpng error: Not a PNG file

You need to change the format of your icon and/or splash screen images to PNG format.

The error message refers to a file named "screen.png" -- which is what each of your splash screens were renamed to before they were moved into the build project resource directories. Unfortunately, JPG images were supplied for use as splash screen images, not PNG images. So the files were renamed and found by the build system to be invalid.

Convert your splash screen images to PNG format. Renaming JPG images to PNG will not work! You must convert your JPG images into PNG format images using an appropriate image editing tool. The Intel XDK does not provide any such conversion tool.

Beginning with Cordova CLI 5, all icons and splash screen images must be supplied in PNG format. This applies to all supported platforms. This is an undocumented "new feature" of the Cordova CLI 5 build system that was implemented by the Apache Cordova project.

Why do I get a "Parse Error" when I try to install my built APK on my Android device?

Because you have built an "unsigned" Android APK. You must click the "signed" box in the Android Build Settings section of the Projects tab if you want to install an APK on your device. The only reason you would choose to create an "unsigned" APK is if you need to sign it manually. This is very rare and not the normal situation.

My converted legacy keystore does not work. Google Play is rejecting my updated app.

The keystore you converted when you updated to 3088 (now 3240 or later) is the same keystore you were using in 2893. When you upgraded to 3088 (or later) and "converted" your legacy keystore, you re-signed and renamed your legacy keystore and it was transferred into a database to be used with the Intel XDK certificate management tool. It is still the same keystore, but with an alias name and password assigned by you and accessible directly by you through the Intel XDK.

If you kept the converted legacy keystore in your account following the conversion you can download that keystore from the Intel XDK for safe keeping (do not delete it from your account or from your system). Make sure you keep track of the new password(s) you assigned to the converted keystore.

There are two problems we have experienced with converted legacy keystores at the time of the 3088 release (April, 2016):

  • Using foreign (non-ASCII) characters in the new alias name and passwords were being corrupted.
  • Final signing of your APK by the build system was being done with RSA256 rather than SHA1.

Both of the above items have been resolved and should no longer be an issue.

If you are currently unable to complete a build with your converted legacy keystore (i.e., builds fail when you use the converted legacy keystore but they succeed when you use a new keystore) the first bullet above is likely the reason your converted keystore is not working. In that case we can reset your converted keystore and give you the option to convert it again. You do this by requesting that your legacy keystore be "reset" by filling out this form. For 100% surety during that second conversion, use only 7-bit ASCII characters in the alias name you assign and for the password(s) you assign.

IMPORTANT: using the legacy certificate to build your Android app is ONLY necessary if you have already published an app to an Android store and need to update that app. If you have never published an app to an Android store using the legacy certificate you do not need to concern yourself with resetting and reconverting your legacy keystore. It is easier, in that case, to create a new Android keystore and use that new keystore.

If you ARE able to successfully build your app with the converted legacy keystore, but your updated app (in the Google store) does not install on some older Android 4.x devices (typically a subset of Android 4.0-4.2 devices), the second bullet cited above is likely the reason for the problem. The solution, in that case, is to rebuild your app and resubmit it to the store (that problem was a build-system problem that has been resolved).

How can I have others beta test my app using Intel App Preview?

Apps that you sync to your Intel XDK account, using the Test tab's green "Push Files" button, can only be accessed by logging into Intel App Preview with the same Intel XDK account credentials that you used to push the files to the cloud. In other words, you can only download and run your app for testing with Intel App Preview if you log into the same account that you used to upload that test app. This restriction applies to downloading your app into Intel App Preview via the "Server Apps" tab, at the bottom of the Intel App Preview screen, or by scanning the QR code displayed on the Intel XDK Test tab using the camera icon in the upper right corner of Intel App Preview.

If you want to allow others to test your app, using Intel App Preview, it means you must use one of two options:

  • give them your Intel XDK userid and password
  • create an Intel XDK "test account" and provide your testers with that userid and password

For security sake, we highly recommend you use the second option (create an Intel XDK "test account"). 

A "test account" is simply a second Intel XDK account that you do not plan to use for development or builds. Do not use the same email address for your "test account" as you are using for your main development account. You should use a "throw away" email address for that "test account" (an email address that you do not care about).

Assuming you have created an Intel XDK "test account" and have instructed your testers to download and install Intel App Preview; have provided them with your "test account" userid and password; and you are ready to have them test:

  • sign out of your Intel XDK "development account" (using the little "man" icon in the upper right)
  • sign into your "test account" (again, using the little "man" icon in the Intel XDK toolbar)
  • make sure you have selected the project that you want users to test, on the Projects tab
  • goto the Test tab
  • make sure "MOBILE" is selected (upper left of the Test tab)
  • push the green "PUSH FILES" button on the Test tab
  • log out of your "test account"
  • log into your development account

Then, tell your beta testers to log into Intel App Preview with your "test account" credentials and instruct them to choose the "Server Apps" tab at the bottom of the Intel App Preview screen. From there they should see the name of the app you synced using the Test tab and can simply start it by touching the app name (followed by the big blue and white "Launch This App" button). Staring the app this way is actually easier than sending them a copy of the QR code. The QR code is very dense and is hard to read with some devices, dependent on the quality of the camera in their device.

Note that when running your test app inside of Intel App Preview they cannot test any features associated with third-party plugins, only core Cordova plugins. Thus, you need to insure that those parts of your apps that depend on non-core Cordova plugins have been disabled or have exception handlers to prevent your app from either crashing or freezing.

I'm having trouble making Google Maps work with my Intel XDK app. What can I do?

There are many reasons that can cause your attempt to use Google Maps to fail. Mostly it is due to the fact that you need to download the Google Maps API (JavaScript library) at runtime to make things work. However, there is no guarantee that you will have a good network connection, so if you do it the way you are used to doing it, in a browser...

<script src="https://maps.googleapis.com/maps/api/js?key=API_KEY&sensor=true"></script>

...you may get yourself into trouble, in an Intel XDK Cordova app. See Loading Google Maps in Cordova the Right Way for an excellent tutorial on why this is a problem and how to deal with it. Also, it may help to read Five Useful Tips on Getting Started Building Cordova Mobile Apps with the Intel XDK, especially item #3, to get a better understanding of why you shouldn't use the "browser technique" you're familiar with.

An alternative is to use a mapping tool that allows you to include the JavaScript directly in your app, rather than downloading it over the network each time your app starts. Several Intel XDK developers have reported very good luck with the open-source JavaScript library named LeafletJS that uses OpenStreet as it's map database source.

You can also search the Cordova Plugin Database for Cordova plugins that implement mapping features, in some cases using native SDKs and libraries.

How do I fix "Cannot find the Intel XDK. Make sure your device and intel XDK are on the same wireless network." error messages?

You can either disable your firewall or allow access through the firewall for the Intel XDK. To allow access through the Windows firewall goto the Windows Control Panel and search for the Firewall (Control Panel > System and Security > Windows Firewall > Allowed Apps) and enable Node Webkit (nw or nw.exe) through the firewall

See the image below (this image is from a Windows 8.1 system).

Google Services needs my SHA1 fingerprint. Where do I get my app's SHA fingerprint?

Your app's SHA fingerprint is part of your build signing certificate. Specifically, it is part of the signing certificate that you used to build your app. The Intel XDK provides a way to download your build certificates directly from within the Intel XDK application (see the Intel XDK documentation for details on how to manage your build certificates). Once you have downloaded your build certificate you can use these instructions provided by Google, to extract the fingerprint, or simply search the Internet for "extract fingerprint from android build certificate" to find many articles detailing this process.

Why am I unable to test or build or connect to the old build server with Intel XDK version 2893?

This is an Important Note Regarding the use of Intel XDK Versions 2893 and Older!!

As of June 13, 2016, versions of the Intel XDK released prior to March 2016 (2893 and older) can no longer use the Build tab, the Test tab or Intel App Preview; and can no longer create custom debug modules for use with the Debug and Profile tabs. This change was necessary to improve the security and performance of our Intel XDK cloud-based build system. If you are using version 2893 or older, of the Intel XDK, you must upgrade to version 3088 or greater to continue to develop, debug and build Intel XDK Cordova apps.

The error message you see below, "NOTICE: Internet Connection and Login Required," when trying to use the Build tab is due to the fact that the cloud-based component that was used by those older versions of the Intel XDK work has been retired and is no longer present. The error message appears to be misleading, but is the easiest way to identify this condition. 

How do I run the Intel XDK on Fedora Linux?

See the instructions below, copied from this forum post:

$ sudo find xdk/install/dir -name libudev.so.0
$ cd dir/found/above
$ sudo rm libudev.so.0
$ sudo ln -s /lib64/libudev.so.1 libudev.so.0

Note the "xdk/install/dir" is the name of the directory where you installed the Intel XDK. This might be "/opt/intel/xdk" or "~/intel/xdk" or something similar. Since the Linux install is flexible regarding the precise installation location you may have to search to find it on your system.

Once you find that libudev.so file in the Intel XDK install directory you must "cd" to that directory to finish the operations as written above.

Additional instructions have been provided in the related forum thread; please see that thread for the latest information regarding hints on how to make the Intel XDK run on a Fedora Linux system.

The Intel XDK generates a path error for my launch icons and splash screen files.

If you have an older project (created prior to August of 2016 using a version of the Intel XDK older than 3491) you may be seeing a build error indicating that some icon and/or splash screen image files cannot be found. This is likely due to the fact that some of your icon and/or splash screen image files are located within your source folder (typically named "www") rather than in the new package-assets folder. For example, inspecting one of the auto-generated intelxdk.config.*.xml files you might find something like the following:

<icon platform="windows" src="images/launchIcon_24.png" width="24" height="24"/><icon platform="windows" src="images/launchIcon_434x210.png" width="434" height="210"/><icon platform="windows" src="images/launchIcon_744x360.png" width="744" height="360"/><icon platform="windows" src="package-assets/ic_launch_50.png" width="50" height="50"/><icon platform="windows" src="package-assets/ic_launch_150.png" width="150" height="150"/><icon platform="windows" src="package-assets/ic_launch_44.png" width="44" height="44"/>

where the first three images are not being found by the build system because they are located in the "www" folder and the last three are being found, because they are located in the "package-assets" folder.

This problem usually comes about because the UI does not include the appropriate "slots" to hold those images. This results in some "dead" icon or splash screen images inside the <project-name>.xdk file which need to be removed. To fix this, make a backup copy of your <project-name>.xdk file and then, using a CODE or TEXT editor (e.g., Notepad++ or Brackets or Sublime Text or vi, etc.), edit your <project-name>.xdk file in the root of your project folder.

Inside of your <project-name>.xdk file you will find entries that look like this:

"icons_": [
  {"relPath": "images/launchIcon_24.png","width": 24,"height": 24
  },
  {"relPath": "images/launchIcon_434x210.png","width": 434,"height": 210
  },
  {"relPath": "images/launchIcon_744x360.png","width": 744,"height": 360
  },

Find all the entries that are pointing to the problem files and remove those problem entries from your <project-name>.xdk file. Obviously, you need to do this when the XDK is closed and only after you have made a backup copy of your <project-name>.xdk file, just in case you end up with a missing comma. The <project-name>.xdk file is a JSON file and needs to be in proper JSON format after you make changes or it will not be read properly by the XDK when you open it.

Then move your problem icons and splash screen images to the package-assets folder and reference them from there. Use this technique (below) to add additional icons by using the intelxdk.config.additions.xml file.

<!-- alternate way to add icons to Cordova builds, rather than using XDK GUI --><!-- especially for adding icon resolutions that are not covered by the XDK GUI --><!-- Android icons and splash screens --><platform name="android"><icon src="package-assets/android/icon-ldpi.png" density="ldpi" width="36" height="36" /><icon src="package-assets/android/icon-mdpi.png" density="mdpi" width="48" height="48" /><icon src="package-assets/android/icon-hdpi.png" density="hdpi" width="72" height="72" /><icon src="package-assets/android/icon-xhdpi.png" density="xhdpi" width="96" height="96" /><icon src="package-assets/android/icon-xxhdpi.png" density="xxhdpi" width="144" height="144" /><icon src="package-assets/android/icon-xxxhdpi.png" density="xxxhdpi" width="192" height="192" /><splash src="package-assets/android/splash-320x426.9.png" density="ldpi" orientation="portrait" /><splash src="package-assets/android/splash-320x470.9.png" density="mdpi" orientation="portrait" /><splash src="package-assets/android/splash-480x640.9.png" density="hdpi" orientation="portrait" /><splash src="package-assets/android/splash-720x960.9.png" density="xhdpi" orientation="portrait" /></platform>

Back to FAQs Main

Jumbo Frames in Open vSwitch* with DPDK

$
0
0

This article describes the concept of jumbo frames and how support for that feature is implemented in Open vSwitch* with the Data Plane Development Kit (OvS-DPDK). It outlines how to configure jumbo frame support for DPDK-enabled ports on an OvS bridge and also provides insight into how OvS-DPDK memory management for jumbo frames works. Finally, it details two tests that demonstrate jumbo frames in action on an OvS-DPDK deployment and looks at another that demonstrates performance gains achieved through the use of jumbo frames. This guide was written with general OvS users in mind, who want to know more about the jumbo frame feature and apply it in their OvS-DPDK deployment. 

At the time of this writing, jumbo frame support for OvS-DPDK is available on the OvS master branch, and also the 2.6 branch. Installation steps for OvS with DPDK can be found here.

Jumbo Frames

A jumbo frame is distinguished from a “standard” frame by its size: any frame larger than the standard Ethernet MTU (Maximum Transmission Unit) of 1500B is characterized as a jumbo frame. The MTU is the largest amount of data that a network interface can send in a single unit. If the network interface wants to transmit a large block of data, it needs to fragment the data into multiple units of size MTU, each unit containing part of the data, plus the required network layer encapsulation headers. If instead, the network devices take advantage of jumbo frames, a significantly larger amount of application data can be carried in a single frame, eliminating much of the overhead incurred by duplication of encapsulation headers.

Thus, the primary benefit of using jumbo frames is the improved data-to-overhead ratio that they provide—the same amount of data can be communicated with significantly less overhead. As a corollary, the resultant reduced packet count also means that the kernel needs to handle fewer interrupts, which reduces the CPU load (N/A in DPDK).

Usage of Jumbo Frames

Jumbo frames are typically beneficial in environments in which large amounts of data need to be transferred, such as Storage Area Networks (SANs), where they improve transfer rates for large files. Many SANs use the Fibre Channel over Ethernet (FCoE) protocol to consolidate their storage and network traffic on a single network; FCoE frames have a minimum payload size of 2112B, so jumbo frames are crucial if fragmentation is to be avoided. Jumbo frames are also useful in overlay networks, where the amount of data that a frame can carry is reduced below the standard Ethernet MTU, as a result of the addition of tunneling headers; boosting the MTU can negate the effects of the additional encapsulation overhead.

Jumbo Frames in OVS

Network devices (netdevs) generally don’t support jumbo frames by default but can be easily configured to do so. Jurisdiction over the MTU of traditional logical network devices is typically beyond the remit of OvS and is instead governed by the kernel’s network stack. A netdev’s MTU can be queried and modified using standard network management tools, such as ifconfig in Linux*. Figure 1 illustrates how ifconfig may be used to increase the MTU of network device p3p3 from 1500 to 9000. The MTU of kernel-governed netdevs is subsequently honored by OVS when those devices are added to an OvS bridge. 


Figure 1: Configuring the MTU of a network device using ifconfig.

OvS-DPDK devices cannot avail of ifconfig, however, as control of DPDK-enabled netdevs is maintained by DPDK poll mode drivers (PMDs) and not standard kernel drivers. The OvS-DPDK jumbo frames feature provides a mechanism which OvS employs to modify the MTU of OvS-DPDK netdevs, thus increasing their maximum supported frame size.

Jumbo Frames in OvS-DPDK

This section provides an overview of how frames are represented in both OvS and DPDK, and how DPDK manages packet buffer memory. It then describes how support for jumbo frames is actually implemented in OvS-DPDK.

In OvS, frames are represented in the OvS datapath (dpif) layer as dp_packets (datapath packets), as illustrated in Figure 2. A dp_packet contains a reference to the packet buffer itself, as well as some additional metadata and offsets that OvS uses to process the frame as it traverses the vSwitch.


Figure 2: Simplified view of Open vSwitch* datapath packet buffer.

In DPDK, a frame is represented by the message buffer data structure (rte_mbuf, or just mbuf for short), as illustrated in Figure 3. An mbuf contains metadata which DPDK uses to process the frame, and a pointer to the message buffer itself, which is stored in contiguous memory just after the mbuf. The mbuf’s buf_addr attribute points to the start of the message buffer, but the frame data itself actually begins at an offset of data_off from buf_addr. The additional data_off bytes, which is typically RTE_PKTMBUF_HEADROOM (128 bytes) long, are allocated in case additional headers need to be prepended before the packet during processing.


Figure 3: Data Plane Development Kit message buffer (‘mbuf’).

Unsurprisingly then, in OvS-DPDK, a frame is represented by a dp_packet, which contains an rte_mbuf. The resultant packet buffer memory layout is shown in Figure 4.


Figure 4: Open vSwitch Data Plane Development Kit packet buffer.

DPDK is targeted for optimized packet processing applications; for such applications, allocation of packet buffer memory from the heap at runtime is much too slow. Instead, DPDK allocates application memory upfront during initialization. To do this, it creates one or more memory pools (mempools) that DPDK processes can subsequently use to create mbufs at runtime with minimum overhead. Mempools are created with the DPDK rte_mempool_create function.

struct rte_mempool *
rte_mempool_create(const char *name, unsigned n, unsigned elt_size,
unsigned cache_size, unsigned private_data_size,
rte_mempool_ctor_t *mp_init, void *mp_init_arg,
rte_mempool_obj_cb_t *obj_init, void *obj_init_arg,
int socket_id, unsigned flags)

The function returns a reference to a mempool containing a fixed number of elements; all elements within the mempool are the same size. The number of elements and their size are determined by the respective values of the cache_size and elt_size parameters provided to rte_mempool_create.

In the case of OvS-DPDK, elt_size needs to be big enough to store all of the data that we observed in Figure 4: Open vSwitch Data Plane Development Kit packet buffer; this includes the dp_packet (and the mbuf that it contains), the L2 header and CRC, the IP payload, and the mbuf headroom (and tailroom, if this is required). By default, the value of elt_size is only large enough to accommodate standard-sized frames (i.e., 1518B or less); however, if it were possible to specify a much larger value, it would allow OvS-DPDK to support jumbo frames in a single mbuf segment.

In OvS, a subset of a net device’s properties can be modified on the command line using the ovs-vsctl utility; OvS 2.6 introduces a new Interface attribute, mtu_request, which users can leverage to adjust the MTU of DPDK devices. For example, to add a physical DPDK port (termed dpdk port in OvS-DPDK) with a Layer 3 MTU of 9000B to OvS bridge br0:

ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000

Alternatively, to reduce the MTU of the same port to 6000 after it has been added to the bridge:

ovs-vsctl -- set Interface dpdk0 mtu_request=6000

Note that mtu_request refers to the Layer 3 MTU; OvS-DPDK allows an additional 18B for Layer 2 header and CRC, so the maximum permitted frame size in the above examples is 9018B and 6018B, respectively. Additionally, ports that use the same MTU share the same mempool; if a port has a different MTU than existing ports, OvS creates an additional mempool for it (assuming that there is sufficient memory to do so). Mempools for MTUs that are no longer used are freed.

Functional Test Configuration

This section outlines two functional tests that demonstrate jumbo frame support across OvS-DPDK physical and guest (dpdkvhostuser) ports. The first test simply demonstrates support for jumbo frames across disparate DPDK port types, while the second additionally shows the effects of dynamically altering a port’s MTU at runtime. Both tests utilize a “hairpin” traffic path, as illustrated in Figure 5. During testing, validation of jumbo frame traffic integrity occurs in two places: (1) in the guest’s network stack via tcpdump, and (2) on the traffic generator’s RX interface, via packet capture and inspection. 


Figure 5: Jumbo frame test configuration.

Test Environment

The DUT used during jumbo frame testing is configured as per Table 1. Where applicable, the software component used is listed with its corresponding commit ID or tag.


Table 1: DUT jumbo frame test environment.

Traffic Configuration

Dummy TCP traffic for both tests is produced by a physical generator; salient traffic attributes are outlined below in Table 2.


Table 2: Jumbo frame test traffic configuration.

9018B frames are used during testing. Note the IP packet size of 9000B and the data size of 8960B, as described in Figure 6; they’ll be important later on during testing.


Figure 6: Jumbo frame test traffic breakdown.

NIC Configuration

No specific configuration of the NIC is necessary in order to support jumbo frames, as the DPDK PMD configures the NIC to support oversized frames as per the user-supplied MTU (mtu_request). The only limitation is that the user-supplied MTU must not exceed the maximum frame size that the hardware itself supports. Consult your NIC datasheet for details. At the time of writing, the maximum frame size supported by the Intel® Ethernet Controller XL710 network adapter is 9728B1, which yields a maximum mtu_request value of 9710.

vSwitch Configuration

Compile DPDK and OvS, mount hugepages, and start up the switch as normal, ensuring that the dpdk-init, dpdk-lcore-mask, and dpdk-socket-mem parameters are set. Note that in order to accommodate jumbo frames at the upper end of the size spectrum, ovs-vswitchd may need additional memory; in this test, 4 GB of hugepages are used.

ovs-vsctl –no-wait set Open_vSwitch.other_config:dpdk-socket-mem=4096,0

Create an OvS bridge of datapath_type netdev, and add 2 x DPDK phy ports, and 2 x guest ports. When adding the ports, specify the mtu_request parameter as 9000. This will allow frames up to a maximum of 9018B to be supported. Incidentally, the value of mtu_request may be modified dynamically at runtime, as we’ll observe later in Test Case #2

ovs-vsctl add-br br0 –- set Bridge br0 datapath_type=netdev
ovs-vsctl –no-wait set Open_vSwitch.other_config:pmd-cpu-mask=6
ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000
ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk0 type=dpdk -- set Interface dpdk1 mtu_request=9000
ovs-vsctl add-port br0 dpdkvhostuser0 -- set Interface dpdkvhostuser0 type=dpdkvhostuser -- set Interface dpdkvhostuser0 mtu_request=9000
ovs-vsctl add-port br0 dpdkvhostuser1 -- set Interface dpdkvhostuser1 type=dpdkvhostuser -- set Interface dpdkvhostuser1 mtu_request=9000

Inspect the bridge to ensure that MTU has been set appropriately for all ports. Note that all of the ports listed in Figure 7 display an MTU of 9000.

ovs-appctl dpctl/show


Figure 7: Open vSwitch* ports configured with 9000B MTU.

Alternatively, inspect the MTU of each port in turn.

ovs-vsctl get Interface [dpdk0|dpdk1|dpdkvhostuser0|dpdkvhostuser1] mtu

Sample output for this command is displayed in Figure 8.


Figure 8: 9000B MTU for port 'dpdkvhostuser0'.

Start the Guest

sudo -E $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 -name us-vhost-vm1 -cpu host -enable-kvm \
-m $MEM -object memory-backend-file,id=mem,size=$MEM,mem-path=$HUGE_DIR,share=on -numa node,memdev=mem -mem-prealloc -smp 2 -drive file=/$VM1 \
-chardev socket,id=char0,path=$SOCK_DIR/dpdkvhostuser0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on \
-chardev socket,id=char1,path=$SOCK_DIR/dpdkvhostuser1 \
-netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=on \
--nographic -vnc :1 \

Guest Configuration

Jumbo frames requires end-to-end configuration, so we’ll need to set the MTU of the relevant guest network devices to 9000, to avoid fragmentation of jumbo frames in the VM’s network stack.

ifconfig eth1 mtu 9000
ifconfig eth2 mtu 9000

Configure IP addresses for the network devices, and then bring them up.

ifconfig eth1 5.5.5.2/24 up
ifconfig eth2 7.7.7.2/24 up

Enable IP forwarding; traffic destined for the 7.7.7.0/24 network will be returned to the vSwitch via the guest’s network stack.

sysctl net.ipv4.ip_forward=1

Depending on the traffic generator setup, a static ARP entry for the traffic destination IP address may be required:

arp -s 7.7.7.3 00:00:de:ad:be:ef

Test Case #1

This test simply demonstrates the jumbo frame feature on OvS-DPDK for dpdk and dpdkvhostuser port types.

Initial setup is as previously described. Simply start continuous traffic to begin the test.

In the guest, turn on tcpdump for the relevant network devices while traffic is live. The output from the tool confirms the presence of jumbo frames in the guest’s network stack. In the sample command lines below, tcpdump output is limited to 20 frames on each port to prevent excessive log output.

tcpdump -i eth1 –v –c 20 # view ingress traffic
tcpdump -i eth2 –v –c 20 # view egress traffic

The output of tcpdump is demonstrated in Figure 9: tcpdump of guest network interfaces. It shows that the length of the IP packets received and subsequently transmitted by the guest is 9000B (circled in blue) and the length of the corresponding data in the TCP segment is 8960B (circled in green). Note that these figures match the traffic profile described in Figure 6: Jumbo frame test traffic breakdown.


Figure 9: tcpdump of guest network interfaces, demonstrating ingress/egress 9000B IP packets containing 8960B of data.

Figure 10 shows the contents of a packet captured at the test endpoint, the traffic generator’s RX port. Note that the Ethernet frame length is 9018B as expected (circled in orange). Additionally, the IP packet length and data length remain 9000B and 8960B, respectively. Since these values remain unchanged for frames that traverse the vSwitch and through a guest, we can conclude that the 9018B frames sent by the generator were not fragmented, thus demonstrating support for jumbo frames for OVS-DPDK dpdk and vhostuser ports.


Figure 10: Packet capture at traffic generator Rx endpoint, demonstrating receipt of 9000B IP packets, containing 8960B of data.

Test Case #2

This test demonstrates runtime modification of a DPDK-based netdev’s MTU, using the ovs-vsctl mtu_request parameter.

Setup is identical to the previous test case; to kick off the test, just start traffic (9018B frames, as per Table 2: Jumbo frame test traffic configuration) on the generator’s Tx interface.

Observe that 9k frames are supported throughout the entire traffic path, as per Test Case #1.

Now reduce the MTU of one of the dpdk (that is, Phy) ports to 6000. This configures the NIC’s Rx port to accept frames with a maximum size of 6018B.2

ovs-vsctl set Interface dpdk0 mtu_request=6000

Verify that MTU was set correctly for dpdk0 and that the MTU for the remaining ports remain unchanged, as per Figure 11.
ovs-vsctl dpctl/show


Figure 11: 6000B MTU for port ‘dpdk0’.

Observe that traffic is no longer received by the vSwitch, as it was dropped by the NIC due to its size, as per Figure 12. The lack of flows installed in the datapath indicates that it is not currently handling any flows.

ovs-appctl dpctl/dump-flows


Figure 12: Empty set of flows processed by OvS userspace datapath.

Running tcpdump in the guest provides additional confirmation that packets are not reaching the guest.

Next reduce the traffic frame size to 6018B in the generator; this frame size is permitted by the NIC’s configuration, as per the previously supplied value of mtu_request. Observe that these frames now pass through to the guest; as expected, the IP packet size is 6000B, and the TCP segment contains 5960B of data (Figure 13).


Figure 13: tcpdump of guest network interfaces, demonstrating ingress/egress 6000B IP packets containing 5960B of data.

Examining traffic captured at the test endpoint, it is confirmed that 6018B frames were received, with IP packet and data lengths as expected.


Figure 14: Packet capture at traffic generator Rx endpoint, demonstrating receipt of 6000B IP packets, containing 5960B of data.

Performance Test Configuration

This section demonstrates the performance benefits of jumbo frames in OVS-DPDK. In the described sample test, two VMs are spawned on the same host, and traffic is transmitted between them. One VM runs an iperf3 server, while the other runs an iperf3 client. iperf3 initiates a TCP connection between the client and server, and transfers large blocks of TCP data between them. Test setup is illustrated in Figure 15.


Figure 15: VM-VM jumbo frame test setup.

Test Environment

The host environment is as described previously, in the “Functional Test Configuration” section.

The guest environment is as described below, in Figure 16.


Figure 16: Jumbo frame test guest environment

vSwitch Configuration

Start OVS, ensuring that the relevant OVSDB DPDK fields are set appropriately.

sudo -E $OVS_DIR/utilities/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
sudo -E $OVS_DIR/utilities/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
sudo -E $OVS_DIR/utilities/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
sudo -E $OVS_DIR/vswitchd/ovs-vswitchd unix:$DB_SOCK --pidfile --detach --log-file &

Create an OVS bridge, and add two dpdkvhostuser ports.

sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 --may-exist add-br br0 -- set Bridge br0 datapath_type=netdev -- br-set-external-id br0 bridge-id br0 -- set bridge br0 fail-mode=standalone
sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 set Open_vSwitch . other_config:pmd-cpu-mask=6
sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 add-port br0 $PORT0_NAME -- set Interface $PORT0_NAME type=dpdkvhostuser
sudo -E $OVS_DIR/utilities/ovs-vsctl --timeout 10 add-port br0 $PORT1_NAME -- set Interface $PORT1_NAME type=dpdkvhostuser

Start the guests, ensuring that mergeable buffers are enabled.

VM1

sudo -E taskset 0x60 $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 -name us-vhost-vm1 -cpu host -enable-kvm -m 4096M -object memory-backend-file,id=mem,size=4096M,mem-path=$HUGE_DIR,share=on -numa node,memdev=mem -mem-prealloc -smp 2 -drive file=$VM1 -chardev socket,id=char0,path=$SOCK_DIR/dpdkvhostuser0 -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on,csum=off,gso=off,guest_csum=off,guest_tso4=off,guest_tso6=off,guest_ecn=off --nographic -vnc :1

VM2

sudo -E taskset 0x180 $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 -name us-vhost-vm2 -cpu host -enable-kvm -m 4096M -object memory-backend-file,id=mem,size=4096M,mem-path=$HUGE_DIR,share=on -numa node,memdev=mem -mem-prealloc -smp 2 -drive file=$VM2 -chardev socket,id=char1,path=$SOCK_DIR/dpdkvhostuser1 -netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=on,csum=off,gso=off,guest_csum=off,guest_tso4=off,guest_tso6=off,guest_ecn=off --nographic -vnc :2

Guest Configuration

Set an IP address for, and bring up, the virtio network device in each guest.

VM1

ifconfig eth1 5.5.5.1/24 up

VM2

ifconfig eth2 5.5.5.2/24 up

Establish Performance Baseline

Start an iperf3 client on VM2.

iperf3 –s

Start an iperf3 client on VM1 and point it to the iperf3 server on VM2.

iperf3 –c 5.5.5.2

Observe the performance of both server and client. Figure 17 demonstrates an average TX rate of 6.98Gbps for data transfers between client and server, which serves as our baseline performance.


Figure 17: Guest iperf3 transfer rates, using standard Ethernet MTU.<

Measure Performance with Jumbo Frames

Note: This test can be done after the previous test. It’s not necessary to tear down the existing setup.

Additional Host Configuration

Increase the MTU for the dpdkvhostuser ports to 9710B (max supported mtu_request).

ovs-vsctl set Interface dpdkvhostuser0 mtu_request=9710
ovs-vsctl set Interface dpdkvhostuser1 mtu_request=9710

Check the bridge to verify that the MTU for each port has increased to 9710B, as per Figure 18.

ovs-appctl dpctl/show


Figure 18: dpdkvhostuser ports with 9710B MTU.

Additional Guest Configuration

In each VM, increase the MTU of the relevant network interface to 9710B, as per Figure 19 and Figure 20.

ifconfig eth1 mtu 9710
ifconfig eth1 | grep mtu


Figure 19: Set 9710B MTU for eth1 on VM1 with ifconfig.

ifconfig eth2 mtu 9710
ifconfig eth2 | grep mtu


Figure 20: Set 9710B MTU for eth2 on VM2 with ifconfig.

Start the iperf3 server in VM2 and kick off the client in VM1, as before. Observe now that throughput has doubled, from its initial rate of ~7 Gbps to 15.6 Gbps (Figure 21: guest iperf3 transfer rates using 9710B MTU).


Figure 21: guest iperf3 transfer rates using 9710B MTU.

Conclusion

In this article, we have described the concept of jumbo frames and observed how they may be enabled at runtime for DPDK-enabled ports in OvS. We’ve also seen how packet buffer memory is organized in OVS-DPDK and learned how to set up and test OVS-DPDK jumbo frame support. Finally, we’ve observed how enabling jumbo frames in OVS-DPDK can dramatically improve throughput for specific use cases.

About the Author

Mark Kavanagh is a network software engineer with Intel. His work is primarily focused on accelerated software switching solutions in user space running on Intel® architecture. His contributions to Open vSwitch with DPDK include incremental DPDK version enablement, Jumbo Frame support3, and TCP Segmentation Offload (TSO) RFC4.

References

  1. http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf, p.72
  2. 6000B IP packet + 14B L2 header + 4B L2 CRC
  3. http://openvswitch.org/pipermail/dev/2016-August/077585.html
  4. http://openvswitch.org/pipermail/dev/2016-June/072871.html

Bringing Up Arduino 101* (branded Genuino 101* outside the U.S.) on Ubuntu* under VMware*

$
0
0

Introduction

The Arduino 101* (branded Genuino 101* outside the U.S.) is a learning and development platform that uses a low-power Intel® Curie™ module powered by the Intel® Quark™ SE microcontroller. The Intel® Quark SE microcontroller contains a single core 32 MHz x86 (Intel® Quark™ processor core) and the 32 MHz Argonaut RISC Core (ARC)* EM processor. The Arduino 101* platform runs on Windows, Macintosh OSX, and Linux operating systems. This guide demonstrates how to run the Arduino 101* platform on Ubuntu using a VMware* Workstation. The VMware* Workstation is a virtual machine that allows you to run applications from other OSes in Linux from the desktop.

Hardware components

The hardware components used in this project are listed below:

Setting up VMware* workstation on Ubuntu*

Go to the VMware website to download and install the latest VMware workstation player for Windows. Then go to the Ubuntu* website and download the latest version of Ubuntu Desktop.

Open VMware and create a new virtual machine using the downloaded Ubuntu image.

Development board download

Visit https://www.arduino.cc/en/Main/Software to download the Arduino Software IDE version 1.6.7 or later for Linux. As of this writing, the latest Linux Arduino IDE version supported by Arduino 101 is arduino-1.6.11-linux64.tar.xz.

Copy arduino-1.6.11-linux64.tar.xz to the Ubuntu folder in the VMWare environment.

Set up the environment for Arduino 101*

Untar arduino-1.6.11-linux64.tar.xz and install the Arduino IDE software.

sudo apt-get update
tar -xvf arduino-1.6.9-linux64.tar.xz
sudo mv arduino-1.6.9 /opt
cd /opt/arduino-1.6.9
~/install.sh

Bring up Arduino on Ubuntu*

1. Connect the Arduino 101 platform to the virtual machine that is running the VMWare workstation.

cd /opt/arduino-1.6.11
sudo ./arduino

Figure 1: Bringing up the Arduino IDE* on the Ubuntu* command line

2. Choose Tools > Board > Boards Manager to launch the board manager to install the Intel® Curie board.

Figure 2: Launching the Boards Manager

Figure 3: Installing Intel® Curie boards

3. Choose Tools > Port and select the Arduino 101 port.

Figure 4: Selecting the Arduino 101* port

4. Choose Tools > Board and select the Arduino 101 board.

Figure 5: Selecting the Arduino 101* board

5. Choose File> Examples> Basics> Blink and open the blink sketch.

Figure 6: Uploading the Blink sketch

The LED on the Arduino 101 platform should now blink.

Figure 7: Arduino 101* with LED Blinking

Arduino 101* Libraries

The Arduino 101* Libraries are a collection of code that provide extra functionality for sketches. They make it easy to connect to Bluetooth LE, sensors, and timers. To experiment with the built-in Arduino 101 libraries, visit https://www.arduino.cc/en/Guide/Libraries. The Arduino 101 libraries are based on the open source corelibs. If you are interested in experimenting the corelibs, visit 01.org’s GitHub*, but these are not required to use the Arduino 101 libraries.

Summary

We have described how to launch the Arduino 101 platform on Ubuntu in VMware. Experiment with the Arduino 101 libraries, Grov*e - Starter Kit Plus, more sensors and shields to enjoy the power of the Intel Curie module.

Helpful References

About the author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group working on Intel® Atom™ processor scale-enabling projects.

Intel® Xeon Phi™ Delivers Competitive Performance For Deep Learning—And Getting Better Fast

$
0
0

Authors:  Dheevatsa Mudigere, Dipankar Das, Vadim Pirogov, Murat Guney, Srinivas Sridharan, and Andres Rodriguez of Intel Corporation

Baidu’s recently announced deep learning benchmark, DeepBench, documents performance for the lowest-level compute and communication primitives for deep learning (DL) applications. The goal is to provide a standard benchmark to evaluate different hardware platforms using the vendor’s DL libraries.

Intel continues to optimize its Intel® Xeon and Intel® Xeon Phi™ processors for DL via the Intel Math Kernel Library (Intel MKL). Intel MKL 2017 includes a collection of performance primitives for DL applications. The library supports the most commonly used primitives necessary to accelerate image recognition topologies and GEMM primitives necessary to accelerate various types of RNNs. The functionality includes convolution, inner product, pooling, normalization and activation primitives, with support for forward (inference) and backward (gradient propagation) operations. The MKL 2017 library is freely available with a community license, and some of these optimizations are also available as part of open source Intel MKL-DNN project.

Intel® Xeon Phi™ processors were used for this benchmark. In this paper, we point out the DL operations where Intel® Xeon Phi™ shines—and is rapidly improving.

DeepBench background

DeepBench aims to include primitives such GEMMs (General Matrix-Matrix Multiplication), convolution layers, and recurrent layers with specific configurations used across different type of networks/applications. The current release is a first attempt at this and is not yet a complete set—the hope is that with active participation from the community this will become a comprehensive benchmark of primitives used in deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs) across a number of applications such as image and speech recognition, and natural language processing (NLP).

Further, DeepBench includes cases with varying mini-batch sizes to capture the impact of scaling (data parallelism). DeepBench is primarily intended as a tool for comparing different hardware platforms. The metrics Baidu published is absolute performance (in terms of TeraFLOPS/s).

In the remainder of this paper, we address the performance and suitability of Intel Xeon Phi processors for various DL operations.

GEMM

The majority of the computations involved in deep learning can be formulated as matrix-matrix multiplication operations, making GEMM a core compute primitive. Typically, the matrices coming from DL applications – fully connected (FC) layers, recurrent neural networks (RNNs), and long short-term memory networks (LSTMs)—are skewed and small, resulting in dense, small and oddly shaped (tall-skinny, fat-short) GEMM operations. This is captured by the GEMM kernel test cases in DeepBench. Intel Math Kernel Library (Intel MKL) 2017 includes optimized GEMM implementations that achieve high performance on both the Intel Xeon processor and Intel Xeon Phi processor for matrices that are typically seen in DL applications, exposed through the packed GEMM application programming interface (API).

Unlike the conventional GEMM operations (large and square matrices), these dense, small, and oddly shaped operations are particularly challenging due to their limited parallelism and highly skewed dimensions. Conventional methods fail to achieve peak performance, as outlined by this Baidu paper on optimizing GEMM performance for RNNs [1].

The specialized packed API implements an optimized block-GEMM operation, with a block formulation to increase reuse of blocks, without additional data rearrangement and fine-grained parallelization with minimal on-demand synchronization to increase the available concurrency for small matrices. These optimizations allow designers to effectively exploit the full-cache hierarchy in both Intel® Xeon and Intel® Xeon Phi™, extracting sufficient parallelism to ensure that all cores are kept busy to achieve significantly improved (near-peak) performance for such typical DL matrices.

The DeepBench GEMM kernel results on the Intel Xeon Phi processor include both the conventional Intel MKL GEMM and also the new packed GEMM API. These numbers are measured on Intel Xeon Phi processor 7250 (codenamed Knights Landing, or KNL) with Intel MKL 2017, which is publicly available. It can be seen from the DeepBench GEMM results (Fig. 1) (Nvidia performance measured by Baidu) that Intel Xeon Phi processor performance is higher than the performance of the Nvidia* M40 GPU (whose peak FLOPs are comparable with Intel Xeon Phi processor) across almost every single configuration, and is higher than the performance on the Nvidia* Pascal TitanX across some smaller, medium (N <= 64) matrices. With the next generation of Intel Xeon Phi processor (codenamed Knights Mill) offering significantly higher raw compute power, we would expect to see better performance when it is released next year.

 

Fig. 1  Source: Data from Baidu as of Sept 26, 2016

 

Convolution

The convolution operation is the other primary compute kernel in DL applications. For image-based application (CNNs), convolution layers contribute to the majority of the compute task. Also, increasingly, convolutional layers are also being used for speech and NLP (as acoustic models) applications, as well.

The convolutional-layer operation consists of a six-level nested loop over output feature maps (K), input feature maps (C), height and width of feature map (H, W), and kernel height and widths (R, S). Additionally, this operation is done over all the samples of the mini-batch (N). Hence, for typical layer configurations, this can result in significant computation.

This operation written as nested for loops in naïve order, does not leverage the available reuse (weights remaining unchanged across recurrent iterations) and becomes limited by memory bandwidth, thus not utilizing all the available compute on the Intel Xeon Phi processor. MKL offers optimized implementation of the convolution layers using the direct convolution operator, which gets close to achievable peak performance.

 

The direct convolution kernel includes the following optimizations:

  • Reformulating the convolution operation to better utilize the cache hierarchy. The loop-over output and input feature maps (K, C) are blocked for on-die caches and allow for inner-loop vectorization over output feature maps and independent fused multiply-add computations

  • Data is laid out so that the innermost loop accesses are contiguous, ensuring better utilization of cache lines and, therefore, bandwidth, while also improving prefetcher performance

  • Register blocking to improve reuse of data in register files, decrease on-core cache traffic, and hide the latency of the FMA operations

  • Optimal work partitioning to ensure that all cores are busy, with minimal load imbalance.

More detailed information on the implementation detail can be found under the machine learning chapter of the Intel Xeon Phi processor reference book [2].

These numbers are measured on Intel Xeon Phi processor 7250 with Intel MKL 2017. For the DeepBench convolution kernel (Fig. 2) we also include results for the open-source Intel optimized convolution layer implementation using libxsmm [3].  The absolute performance for the convolution layers are competitive compared to the Nvidia M40 (with comparative FLOPs). However, since the current version of Intel MKL supports only the direct convolution operator, the marked kernels with larger differences are for those convolutions layers where the direct convolution kernels on Intel Xeon Phi processor are being compared to the Winograd-based implementation. Intel MKL does not provide optimized Winograd convolution implementations at this point. Incorporating the Winograd-based implementation into MKL is work-in-progress. Despite the fact that the Winograd convolution algorithm shows significant speedups on certain convolution shapes and sizes, it does not significantly contribute to full topology performance.

Fig. 2 Source: Data from Baidu as of Sept 26, 2016

AllReduce

AllReduce is the communications primitive in DeepBench that covers message sizes commonly seen in deep learning networks and applications. This benchmark consists of measuring MPI_AllReduce latencies for five different message sizes (in floats):  100K, 3MB, 4MB, 6.4MB, and 16MB on 2, 4, 8, 16, and 32 nodes. This uses the AllReduce benchmark from the Ohio State University micro-benchmarks suite [4], with minor modifications.

We report the MPI_AllReduce time measured on Intel Xeon Phi processor 7250 on our internal Endeavor cluster with Intel® Omni-Path Architecture (Intel® OPA) series 100 fabric with fat-tree topology, using Intel MPI 5.1.3.181. The competitive data from Baidu is measured on a GPU cluster with 8 NVIDIA TitanX-Maxwell cards per node (with optimized intra-node comms). This is compared with a single Intel Xeon Phi processor consisting of a node, so a 32-node Intel Xeon Phi processor measurement below is comparable to 4 GPGPU nodes, with each node having a maximum of 8 cards.

The Intel Xeon Phi processor DeepBench results are with the stock Intel MPI Library 2017 on the above-mentioned cluster (Fig. 3). The latencies are better for 8 GPUs (in a single node) compared to 8 Intel Xeon Phi processor-based nodes since this constitutes to within node (peer-to-peer) communication. However, for communication across nodes, the Intel Xeon Phi processor AllReduce latencies are significantly better. Latencies for 16 Intel Xeon Phi processor-based nodes were better than that for 2 GPU (x8) nodes across most message sizes. Fig. 3 is normalized against TitanX-Maxwell, which Baidu measured as being number-1.

Fig. 3 Source: Data from Baidu as of Sept 26, 2016

Further, we also present results using our optimized communication library (which was also presented in Pradeep Dubey’s IDF16 Technical Session) which further improves AllReduce latencies by an average of 3.5X across the message sizes and node counts of interest (Fig. 4). This benchmark only captures the latency for a single MPI_AllReduce operation. Typically, in any application context we can expect to have multiple such operations in flight, and in such situations we can expect to see further performance improvements.

Fig. 4 Source: Intel internal measurements, September 2016 [5]

 

Recurrent Layers – RNN/LSTM

DeepBench also includes recurrent layers—vanilla RNN and LSTM layers, primarily based on the DeepSpeech2 model configurations and for different mini-batch sizes—to capture the impact of scaling. The core compute kernel for these recurrent layers are still the GEMM operations, and the matrix sizes corresponding to these layers are already captured in the GEMM benchmark. For these cases, we can see from Fig. 1 that Intel Xeon Phi consistently performs better than current Nvida Maxwell GPUs and in many cases also better than Nvidia Pascal GPU which has almost 2X more peak flops.  However, these layers are included as independent primitives to showcase RNN/LSTM-specific optimizations. For the current benchmark release, we do not include Intel Xeon Phi processor results for these cases. We are working on an optimized implementation for the RNN layers, which exploits the specific usage pattern to more efficiently leverage available caches.  The Intel Xeon Phi processor results for these cases will be added once RNN layers support is introduced to Intel MKL.

While these benchmark results are a snapshot in time, Intel continues to invest in software optimizations that would further improve the performance of Intel Xeon and Intel Xeon Phi family processors especially on the convolution benchmark. Intel will continue to update the results to ensure end customers have a choice of silicon when it comes to deep learning workloads.

Authors

Dheevatsa Mudigere, Dipankar Das, Vadim Pirogov, Murat Guney, Srinivas Sridharan, and Andres Rodriguez, Intel Corporation

 

[1] http://svail.github.io/rnn_perf/

[2] http://lotsofcores.com/KNLbook

[3] https://github.com/hfp/libxsmm

[4] http://mvapich.cse.ohio-state.edu/benchmarks/

[5] FTC Disclaimer:  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. 
 

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.   For more complete information visit: www.intel.com/benchmarks.  

Configuration: Intel® Xeon Phi™ Processor 7250 (68 Cores, 1.4 GHz, 16GB MCDRAM), 128 GB DDR4-2400 MHz, Intel® Omni-Path Host Fabric Interface Adapter 100 Series 1 Port, Red Hat* Enterprise Linux 6.7, Intel® ICC version 16.0.3, Intel® MPI Library 5.1.3 for Linux, Intel® Optimized DNN Framework

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.  Notice Revision #20110804

 

Open vSwitch* with DPDK Overview

$
0
0

This article presents a high-level overview of Open vSwitch* with the Data Plane Development Kit (OvS-DPDK)—the high performance, open source virtual switch—and links to further technical articles that dive deeper into individual OvS-DPDK features. This article was written for users of OvS who want to know more about DPDK integration.

Note: Users can download a zip file of the OVS master branch or the 2.6 branch, as well as installation steps for the master branch or the 2.6 branch

OvS-DPDK High-level Architecture

Open vSwitch is a production quality, multilayer virtual switch licensed under the open source Apache* 2.0 license. It supports SDN control semantics via the OpenFlow* protocol and its OVSDB management interface. It is available from openvswitch.org, GitHub*, and is also consumable through Linux* distributions.

Native Open vSwitch generally forwards packets via the kernel space data path (see Figure 1). In the kernel data path, the switching “fastpath” consists of a simple flow table indicating forwarding/action rules for packets that are received. Exception packets (first packet in a flow) do not match any existing entries in the kernel fastpath table and are sent to the user space daemon for processing (slowpath). After user space handles the first packet in the flow, the daemon will then update the flow table in kernel space so that subsequent packets in the flow can be processed in the fastpath and not sent to user space. Following this approach, native OvS can eliminate the costly context switch between kernel and user space for a large percentage of received packets. However, the achievable packet throughput is limited by the forwarding bandwidth of the Linux network stack, which is not suited for use cases requiring a high rate of packet processing; for example, Telco.

DPDK is a set of user space libraries that enable a user to create optimized performant packet processing applications (information available at DPDK.org). In practice, it offers a series of Poll Mode Drivers (PMDs), which enable direct transferral of packets between user space and the physical interface, bypassing the kernel network stack. This offers a significant performance boost over kernel forwarding, through the elimination of both interrupt handling and traversal of the kernel network stack. By integrating OvS with DPDK, the switching fastpath is in user space, and the exception path is the same path that is traversed by packets in the kernel fastpath case. The integration of DPDK with OvS is illustrated at a high level in Figure 1.

Integration of Data Plane Development Kit data plane with native Open vSwitch*

Figure 1: Integration of Data Plane Development Kit data plane with native Open vSwitch*.

Figure 2 below shows the high-level architecture of OvS-DPDK. OvS switching ports are represented by network devices (or netdevs). Netdev-dpdk is a DPDK-accelerated network device that uses DPDK to accelerate switch I/O, through three separate interfaces: one physical interface (handled by the librte_eth library within DPDK), and two virtual interfaces (librte_vhost and librte_ring). These interface with the physical and virtual devices connected to the virtual switch.

Other OvS architectural layers provide further functionality and interface with, for example, the SDN controller. Dpif-netdev provides user space forwarding and ofproto is the OvS library that implements an OpenFlow switch. It talks to OpenFlow controllers over the network and to switch hardware or software through an ofproto provider. The ovsdb server maintains the up-to-date switching table information for this OvS instance and communicates this to the SDN controller. The following section provides details of the switching/forwarding tables, with further information on the OvS architecture available through the openvswitch.org website.

Open vSwitch* with Data Plane Development Kit high-level architecture

Figure 2: Open vSwitch* with Data Plane Development Kit high-level architecture.

OvS-DPDK Switching Table Hierarchy

A packet entering OvS-DPDK from a physical or virtual interface receives a unique identifier or hash, based on its header fields, which is then matched against an entry in one of three main switching tables: the exact match cache (EMC), the data path classifier (dpcls), or the ofproto classifier. A packet’s identifier will traverse each of these three tables in order, unless a match is found, in which case the appropriate actions indicated by the match rule in the table will be executed and the packet forwarded out of the switch upon completion of all actions. This scheme is illustrated in Figure 3.

Open vSwitch* with Data Plane Development Kit switching table hierarchy

Figure 3: Open vSwitch* with Data Plane Development Kit switching table hierarchy.

The three tables have different characteristics and associated throughput performance/latency. The EMC offers fastest processing for a limited number of table entries. The packet’s identifier must exactly match the entry in this table for all fields—the 5-tuple of source IP and port, destination IP and port, and protocol—for highest speed processing or it will “miss” on the EMC and pass through to the dpcls. The dpcls contains many more table entries (arranged in multiple subtables) and enables wildcard matching of the packet identifier (for example, destination IP and port are specified but any source is allowed). This gives approximately half the throughput performance of the EMC and caters to a much larger number of table entries. Packet flows matched in the dpcls are installed in the EMC so that subsequent packets with the same identifier can be processed at the highest speed.

A miss on the dpcls results in the packet identifier being sent to the ofproto classifier so that the OpenFlow controller can decide on the action. This path is the least performant, >10x slower than the EMC. Matches in the ofproto classifier result in new table entries being established in the faster switching tables so that subsequent packets in the same flow can be processed more quickly.

OvS-DPDK Features and Performance

At the time of this writing, the following high-level OvS-DPDK features are available on the OvS master code branch:

  • DPDK support for v16.07 (supported version increments with each new DPDK release)
  • vHost user support
  • vHost reconnect
  • vHost multiqueue
  • Native tunneling support: VxLAN, GRE, Geneve
  • VLAN support
  • MPLS support
  • Ingress/egress QoS policing
  • Jumbo frame support
  • Connection tracking
  • Statistics: DPDK vHost and extended DPDK stats
  • Debug: DPDK pdump support
  • Link bonding
  • Link status
  • VFIO support
  • ODL/OpenStack detection of DPDK ports
  • vHost user NUMA awareness

A recent performance comparison between native OvS and OvS-DPDK is highlighted in Figure 4. This shows the throughput in packets-per-second for the Phy-OvS-Phy use case, indicating a ~10x performance enhancement for OvS-DPDK over native OvS, increasing to ~12x with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled (labelled 1C2T, or one physical core with two logical threads, in the figure legend). Similarly, the Phy-OvS-VM-OvS-Phy use case demonstrates a ~9x performance enhancement for OvS-DPDK over native OvS.

Performance comparison - native Open vSwitch* (OvS) and OvS with Data Plane Development Kit

Figure 4: Performance comparison - native Open vSwitch* (OvS) and OvS with Data Plane Development Kit.

The hardware and software configuration for this data, along with further use case results, can be found in the Intel® Open Network Platform (Intel® ONP) performance report.

OvS-DPDK Availability

OvS-DPDK is available in the upstream openvswitch.org repository and is also available through Linux distributions as below. The latest milestone release is OvS 2.6 (September 2016), and releases are made with a six-month cadence.

Code is available for download as follows: OvS master branch; OvS 2.6 release branch. Installation steps for the master branch are available as well as installation steps for the 2.6 release branch.

Packaged versions of OvS with DPDK are available from:

Red Hat* OpenStack Platform

Ubuntu*

Mirantis* OpenStack

Open Platform for NFV*

Additional Information

To learn more about OvS-DPDK, check out the following videos and articles on Intel® Developer Zone, 01.org, Intel® Network Builders and Intel® Network Builders University.

User guides:

Developer guides:

Articles:

OvS with DPDK milestone release webinars:

INB university:

White paper:

Have a question? Feel free to follow up with the query on the Open vSwitch discussion mailing thread.

About the Author

Robin Giller is a program manager with the Intel Network Platforms Group.

Getting Started with Intel® Software Optimization for Theano* and Intel® Distribution for Python*

$
0
0

Contents

Summary

Theano is a Python* library developed at the LISA lab to define, optimize, and evaluate mathematical expressions, including the ones with multi-dimensional arrays (numpy.ndarray). Intel® optimized-Theano is a new version based on Theano 0.0.8rc1, which is optimized for Intel® architecture and enables Intel® Math Kernel Library (Intel® MKL) 2017. The latest version of the Intel MKL includes optimizations for Intel® Advanced Vector Extensions 2 (Intel® AVX2) and AVX-512 instructions which are supported in Intel® Xeon® processor and Intel® Xeon Phi™ processors.

Theano can be installed and used with several combinations of development tools and libraries on a variety of platforms. This tutorial provides one such recipe describing steps to build and install Intel optimized-Theano with Intel® compilers and Intel MKL 2017 on CentOS*- and Ubuntu*-based systems. We also verify the installation by running common industry-standard benchmarks like MNIST*, DBN-Kyoto*, LSTM* and ImageNet*.

Prerequisites

Intel® Compilers and Intel® Math Kernel Library 2017

This tutorial assumes that Intel compilers(C/C++ and Fortran)  are already installed and verified. If not, Intel compilers can be downloaded and installed as part of the Intel® Parallel Studio XE or can be independently installed.

Installing Intel MKL 2017 is optional when using Intel® Distribution for Python*. For other python distributions Intel MKL 2017 can be downloaded as part of Intel Parallel Studio XE 2017 or can be downloaded and installed for free using the community license. To download it, first register here for a free community license and follow the installation instructions.

Python* Tools

In this tutorial, the Intel® Distribution for Python* will be used as it provides ready access to tools and techniques which are enabled and verified for higher performance on Intel architecture. This will allow usage of Intel-optimized precompiled tools like NumPy* and SciPy* without worrying about building and installing them. 

Intel Distribution for Python can be available as part of Intel Parallel Studio XE or can be also independently downloaded for free from here.

Instructions to install Intel Distribution for Python are given below. This article assumes that the Python installation is completed in the local user account

Python 2.7
tar -xvzf l_python27_p_2017.0.028.tgz
cd l_python27_p_2017.0.028
./install.sh

Python 3.5
tar -xvzf l_python35_p_2017.0.028.tgz
cd l_python35_p_2017.0.028
./install.sh

Using anaconda, create an independent user environment using the steps given below. Here the required NumPy, SciPy and Cython packages are also being installed with the . 

Python 2.7
conda create -n pcs_theano_2 -c intel python=2 numpy scipy cython
source activate pcs_theano_2

Python 3.5
conda create -n pcs_theano_2 -c intel python=3 numpy scipy cython
source activate pcs_theano_2

Alternatively, NumPy and SciPy can also be built and installed from the source as given inAppendix A. Steps to install other python development tools is also shown which may be required in case a non-intel distribution of python is used.

 

Building and installing Intel® Software Optimization for Theano*

Branch of theano optimized for Intel architecture can be checked out and installed from the following git repository.

git clone https://github.com/intel/theano.git theano
cd theano
python setup.py build
python setup.py install
theano-cache clear

An example of the Theano configuration file is given below for reference. In order to use Intel compilers and specify the compiler flags to be used with Theano, create a copy of this file in user's home directory.

vi ~/.theanorc

[cuda]
root = /usr/local/cuda

[global]
device = cpu
floatX = float32
cxx = icpc
mode = FAST_RUN
openmp = True
openmp_elemwise_minsize = 10
[gcc]
cxxflags = -qopenmp -march=native -O3 -vec-report3 -fno-alias -opt-prefetch=2 -fp-trap=none
[blas]
ldflags = -lmkl_rt

 

Verify Theano and NumPy Installation

It is important to verify which versions of Theano and NumPy libraries are referenced once they are imported in python. The versions of NumPy and Theano referenced in this article  are verified as follows:  

python -c "import numpy; print (numpy.__version__)"
->1.11.1
python -c "import theano; print (theano.__version__)"
-> 0.9.0dev1.dev-*

It is also important to verify that the installed versions of NumPy and Theano are using Intel MKL.

python -c "import theano; print (theano.numpy.show_config())"

NumPy_Config

Fig 1. Desired output for theano.numpy.show_config()

 

Benchmarks

DBN-Kyoto and ImageNet benchmarks are available in the theano/democase directory.

DBN-Kyoto

Procuring the Dataset for Running DBN-Kyoto

The sample dataset can be downloaded for DBN-Kyoto from Dropbox via the following link:https://www.dropbox.com/s/ocjgzonmxpmerry/dataset1.pkl.7z?dl=0. Unzip the file and save it in the theano/democase/DBN-Kyoto directory.

Prerequisites

Dependencies for training DBN-Kyoto can be installed using Anaconda or built using the provided source in the tools directory. Due to some conflicts in the pandas library and Python 3, this benchmark is validated only for Python 2.7.

Python 2.7
conda install -c intel --override-channels pandas
conda install imaging

Alternatively the dependencies can also be installed from source as given in Appendix B.

Running DBN-Kyoto on CPU

The provided run.sh script can be used to download the dataset (if not already present) and start the training.

cd theano/democase/DBN-Kyoto/
./run.sh

 

MNIST

In this article, we show how to train a neural network on MNIST using Lasagne, which is a lightweight library to build and train neural networks in Theano. The Lasagne library will be built and installed using Intel compilers.

Download the MNIST Database

The MNIST database can be downloaded from http://yann.lecun.com/exdb/mnist/. We downloaded images and labels for both training and validation data. 

Installing Lasagne Library

The latest version of the Lasagne library can be built and installed from the Lasagne git repository as given below:

Python 2.7 and Python 3.5
git clone https://github.com/Lasagne/Lasagne.git
cd Lasagne
python setup.py build
python setup.py install

Training

cd Lasagne/examples
python mnist.py [model [epochs]]
                    --  where model can be mlp - simple multi layer perceptron (default) or
                         cnn - simple convolution neural network.
                         and epochs = 500 (default)

 

AlexNet

Procuring the ImageNet dataset for AlexNet training

The ImageNet dataset can be obtained from the image-net website

Prerequisites

Dependencies for training AlexNet can be installed using Anaconda or installed from the fedora epel source repository. Currently, support for Hickle (required dependency for preprocessing data) is only available in Python 2 and not supported on Python 3.

  • Installing h5py, pyyaml, pyzmq using Anaconda:
conda install h5py
conda install -c intel --override-channels pyyaml pyzmq
  • Installing Hickle (HDF5-based clone of Pickle):
git clone https://github.com/telegraphic/hickle.git
cd hickle
python setup.py build
python setup.py install

Alternatively, the dependencies can also be installed using the source as given in appendix B.

Preprocessing the ImageNet Dataset

Preprocessing is required to dump Hickle files and create labels for training and validation data.

  • Modify the paths.yaml file in the preprocessing directory to update the path for the dataset. One example of paths.yaml file is given below for reference.
cat theano/democase/alexnet_grp1/preprocessing/paths.yaml

train_img_dir: '/mnt/DATA2/TEST/ILSVRC2012_img_train/'
# the dir that contains folders like n01440764, n01443537, ...

val_img_dir: '/mnt/DATA2/TEST/ILSVRC2012_img_val/'
# the dir that contains ILSVRC2012_val_00000001~50000.JPEG

tar_root_dir: '/mnt/DATA2/TEST/parsed_data_toy'  # dir to store all the preprocessed files
tar_train_dir: '/mnt/DATA2/TEST/parsed_data_toy/train_hkl'  # dir to store training batches
tar_val_dir: '/mnt/DATA2/TEST/parsed_data_toy/val_hkl'  # dir to store validation batches
misc_dir: '/mnt/DATA2/TEST/parsed_data_toy/misc'
# dir to store img_mean.npy, shuffled_train_filenames.npy, train.txt, val.txt

meta_clsloc_mat: '/mnt/DATA2/imageNet-2012-images/ILSVRC2014_devkit/data/meta_clsloc.mat'
val_label_file: '/mnt/DATA2/imageNet-2012-images/ILSVRC2014_devkit/data/ILSVRC2014_clsloc_validation_ground_truth.txt'
# although from ILSVRC2014, these 2 files still work for ILSVRC2012

# caffe style train and validation labels
valtxt_filename: '/mnt/DATA2/TEST/parsed_data_toy/misc/val.txt'
traintxt_filename: '/mnt/DATA2/TEST/parsed_data_toy/misc/train.txt'

Toy data set can be created using the provided script - generate_toy_data.sh1.

cd theano/democase/alexnet_grp1/preprocessing
chmod u+x make_hkl.py make_labels.py make_train_val_txt.py
./generate_toy_data.sh

AlexNet training on CPU

  • Modify the config.yaml file to update the path to the preprocessed dataset:
cd theano/democase/alexnet_grp1/

# Sample changes to the path for input(label_folder, mean_file) and output(weights_dir)
label_folder: /mnt/DATA2/TEST/parsed_data_toy/labels/
mean_file: /mnt/DATA2/TEST/parsed_data_toy/misc/img_mean.npy
weights_dir: ./weight/  # directory for saving weights and results
  • Similarly, modify the spec.yaml file to update the path to the parsed toy data set:
# Directories
train_folder: /mnt/DATA2/TEST/parsed_data_toy/train_hkl_b256_b256_bchw/
val_folder: /mnt/DATA2/TEST/parsed_data_toy/val_hkl_b256_b256_bchw/
  • Start the training:
./run.sh

Large Movie Review Dataset (IMDB)

The Large Movie Review Dataset is an example of a Recurring Neural Network using a Long Short-Term Memory (LSTM) model. The IMDB data set is used for sentiment analysis on movie reviews using the LSTM model.

Procuring the dataset:

Obtain the imdb.pkl file from http://www-labs.iro.umontreal.ca/~lisa/deep/data/ and extract the file to a local folder.

Preprocessing

The http://deeplearning.net/tutorial/lstm.html page provides two scripts:

Imdb.py – This handles the loading the preprocessing of the IMDB dataset.

Lstm.py – This is the primary script that defines and trains the model.

Copy both of the above files into the same folder where we have the imdb.pkl file.

Training

Training can be started using the following command:

THEANO_FLAGS="floatX=float32" python lstm.py

Troubleshooting

Error 1: In some cases, you might get errors like libmkl_rt.so or libimf.so, which cannot be opened. In this case try the below:

find /opt/intel -name library_name.so

Add the paths to get to the /etc/ ld.so.conf file and run the ldconfig command to link the libraries. Also make sure the MKL installation paths are set correctly in the LD_LIBRARY_PATH environment variable.

Error 2: AlexNet preprocessing error for toy data

python make_hkl.py toy
generating toy dataset ...
Traceback (most recent call last):
  File "make_hkl.py", line 293, in <module>
    train_batchs_per_core)
ValueError: xrange() arg 3 must not be zero

The default number of processes used to preprocess ImageNet is currently set to 16. For the toy dataset this will create more processes than required, causing the application to crash. To resolve this issue, change the number of processes in file Alexnet_CPU/preprocessing/make_hkl.py:258 from 16 to 2. However, while preprocessing the full data set it is recommended to use a higher value for num_process for faster preprocessing.

num_process = 2

Error 3: Referencing the current version of Numpy when installing Intel(R) Distribution of Python* through Conda

If installing the Intel(R) Distribution of Python from within Conda instead of through the Intel(R) Distribution of Python installer, make sure that you set the PYTHONNOUSERSITE environment variable to True. This will enable the Conda environment to reference the correct version of Numpy. This is a known error in Conda. More information can be found here.

export PYTHONNOUSERSITE=True

Resources

Appendix A

Installing Python* Tools For Other Python Distribution

CentOS:
Python 2.7 - sudo yum install python-devel python-setuptools
Python 3.5 - sudo yum install python35-libs python35-devel python35-setuptools
//Note - Python 3.5 packages can be obtained from Fedora EPEL source repository
Ubuntu:
Python 2.7 - sudo apt-get install python-dev python-setuptools
Python 3.5 - sudo apt-get install libpython3-dev python3-dev python3-setuptools
  • Incase pip and cython are not installed on the system, they can be installed using the following commands:
sudo -E easy_install pip
sudo -E pip install cython

 

Installing NumPy

NumPy is the fundamental package needed for scientific computing with Python. This package contains:

  1. A powerful N-dimensional array object
  2. Sophisticated (broadcasting) functions
  3. Tools for integrating C/C++ and Fortran code
  4. Useful linear algebra, Fourier transform, and random number capabilities.

Note: An older version of the NumPy library can be removed by verifying its existence and deleting the related files. However, in this tutorial all the remaining libraries will be installed in user’s local directory, so this step is optional. If required, old versions can be cleaned as follows:

  • Verify if old version exists:
python -c "import numpy; print numpy.version"<module 'numpy.version' from '/home/plse/.local/lib/python2.7/site-packages/numpy-1.11.0rc1-py2.7-linux-x86_64.egg/numpy/version.pyc'>
  • Delete any previously installed NumPy packages:
rm -r /home/plse/.local/lib/python2.7/site-packages/numpy-1.11.0rc1-py2.7-linux-x86_64.egg
  • Building and installing NumPy optimized for Intel architecture:
git clone https://github.com/pcs-theano/numpy.git
//update site.cfg file to point to required MKL directory. This step is optional if parallel studio or MKL were installed in default /opt/intel directory.
python setup.py config --compiler=intelem build_clib --compiler=intelem build_ext --compiler=intelem install --user

 

Installing SciPy

SciPy is an open source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering

  • Building and installing SciPy:
tar -xvzf scipy-0.16.1.tar.gz    (can be downloaded from: https://sourceforge.net/projects/scipy/files/scipy/0.16.1/  or
     obtain the latest sources from https://github.com/scipy/scipy/releases)
cd scipy-0.16.1/
python setup.py config --compiler=intelem --fcompiler=intelem build_clib --compiler=intelem --fcompiler=intelem build_ext --compiler=intelem --fcompiler=intelem install --user

Appendix B

Building and installing benchmark dependencies from source

DBN-Kyoto

//Untar and install all the provided tools:

cd theano/democase/DBN-Kyoto/tools
tar -xvzf Imaging-1.1.7.tar.gz
cd Imaging-1.1.7
python setup.py build
python setup.py install --user

cd theano/democase/DBN-Kyoto/tools
tar -xvzf python-dateutil-2.4.1.tar.gz
cd python-dateutil-2.4.1
python setup.py build
python setup.py install --user

cd theano/democase/DBN-Kyoto/tools
tar -xvzf pytz-2014.10.tar.gz
cd pytz-2014.10
python setup.py build
python setup.py install --user

cd theano/democase/DBN-Kyoto/tools
tar -xvzf pandas-0.15.2.tar.gz
cd pandas-0.15.2
python setup.py build
python setup.py install --user

 

AlexNet

  • Installing dependencies for AlexNet from source

Access to some of the add-on packages from the fedrora epel source repository may be required for running AlexNet on CPU.

wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-8.noarch.rpm
sudo rpm -ihv epel-release-7-8.noarch.rpm
sudo yum install hdf5-devel
sudo yum install zmq-devel
sudo yum install zeromq-devel
sudo yum install python-zmq
  • Installing Hickle (HDF5-based clone of Pickle):
git clone https://github.com/telegraphic/hickle.git
python setup.py build install --user
  • Installing h5py (Python interface to HDF5 binary data format):
git clone https://github.com/h5py/h5py.git
python setup.py build install --user

 

References

 

About The Authors

Sunny GogarSunny Gogar
Software Engineer

Sunny Gogar received a Master’s degree in Electrical and Computer Engineering from the University of Florida, Gainesville and a Bachelor’s degree in Electronics and Telecommunications from the University of Mumbai, India.  He is currently a software engineer with Intel Corporation's Software and Services Group. His interests include parallel programming and optimization for Multi-core and Many-core Processor Architectures.

 Meghana Rao received a Master’s degree in Engineering and Technology Management from Portland State University and a Bachelor’s degree in Computer Science and Engineering from Bangalore University, India.  She is a Developer Evangelist with the Software and Services Group at Intel focused on Machine Learning and Deep Learning.

 

How to Emulate Persistent Memory on an Intel® Architecture Server

$
0
0

Introduction

This tutorial provides a method for setting up persistent memory (PMEM) emulation using regular dynamic random access memory (DRAM) on an Intel® processor using a Linux* kernel version 4.3 or higher. The article covers the hardware configuration and walks you through setting up the software. After following the steps in this article, you'll be ready to try the PMEM programming examples in the NVM Library at pmem.io.

Why do this?

If you’re a software developer who wants to get started early developing software or preparing your applications to have PMEM awareness, you can use this emulation for development before PMEM hardware is widely available.

What is persistent memory?

Traditional applications organize their data between two tiers: memory and storage. Emerging PMEM technologies introduces a third tier. This tier can be accessed like volatile memory, using processor load and store instructions, but it retains its contents across power loss like storage. Because the emulation uses DRAM, data will not be retained across power cycles.

Hardware and System Requirements

Emulation of persistent memory is based on DRAM memory that will be seen by the operating system (OS) as a Persistent Memory region. Because it is a DRAM-based emulation it is very fast, but will lose all data upon powering down the machine. The following hardware was used for this tutorial:

CPU and Chipset

Intel® Xeon® processor E5-2699 v4 processor, 2.2 GHz

  • # of cores per chip: 22 (only used single core)
  • # of sockets: 2
  • Chipset: Intel® C610 chipset, QS (B-1 step)
  • System bus: 9.6 GT/s Intel® QuickPath Interconnect

Platform

Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass)

  • BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932
  • DIMM slots: 24
  • Power supply: 1x1100W

Memory

Memory size: 256 GB (16X16 GB) DDR4 2133P

Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG

Storage

Brand and model: 1 TB Western Digital* (WD1002FAEX)

Operating system

CentOS* 7.2 with kernel 4.5.3

Table 1 - System configuration used for the PMEM emulation.

Linux* Kernel

Linux Kernel 4.5.3 was used during development of this tutorial. Support for persistent memory devices and emulation have been present in the kernel since version 4.0, however a kernel newer than 4.2 is recommended for easier configuration. The emulation should work with any Linux distribution able to handle an official kernel. To configure the proper driver installation, run make nconfig and enable the driver. Per the instructions below, Figures 1 to 5 show the correct setting for the NVDIMM Support in the Kernel Configuration menu.

$ make nconfig

        -> Device Drivers -> NVDIMM Support -><M>PMEM; <M>BLK; <*>BTT

 Set up the device drivers.
Figure 1:Set up device drivers.

 Set up the NVDIMM device.
Figure 2:Set up the NVDIMM device.

 Setup the file system for Direct Access support.
Figure 3:Set up the file system for Direct Access support.

 Setting for Direct Access support.
Figure 4: Set up for Direct Access (DAX) support.

 Property of the NVDIMM support.
Figure 5:NVDIMM Support property.

The kernel will offer these regions to the PMEM driver so they can be used for persistent storage. Figures 6 and 7 show the correct setting for the processor type and features in the Kernel Configuration menu.

$ make nconfig

        -> Processor type and features<*>Support non-standard NVDIMMs and ADR protected memory

Figures 4 and 5 show the selections in the Kernel Configuration menu.

 Set up the processor to support NVDIMM.
Figure 6:Set up the processor to support NVDIMMs.

 Enable the NON-standard NVDIMMs and ADR protected memory.
Figure 7:Enable NON-standard NVDIMMs and ADR protected memory.

Now you are ready to build your kernel using the instructions below.

$ make -jX

        Where X is the number of cores on the machine

During the new kernel build process, there is a performance benefit to compiling the new kernel in parallel. An experiment with one thread to multiple threads shows that the compilation can be up to 95 percent faster than a single thread. With the time saved using multiple thread compilation for the kernel, the whole new kernel setup goes much faster. Figures 8 and 9 show the CPU utilization and the performance gain chart for compiling at different numbers of threads.

 Compiling the kernel sources.
Figure 8:Compiling the kernel sources.

 Performance gain for compiling the source in parallel.
Figure 9:Performance gain for compiling the source in parallel.

Install the Kernel

# make modules_install install

 Installing the kernel.
Figure 10:Installing the kernel.

Reserve a memory region by modifying kernel command line parameters so it appears as a persistent memory location to the OS. The region of memory to be used is from ss to ss+nn. [KMG] refers to kilo, mega, giga.

memmap=nn[KMG]!ss[KMG]

For example, memmap=4G!12G reserves 4 GB of memory between 12th and 16th GB. Configuration is done within GRUB and varies between Linux distributions. Here are two examples of a GRUB configuration.

Under CentOS 7.0

# vi /etc/default/grub
GRUB_CMDLINE_LINUX="memmap=nn[KMG]!ss[KMG]"
On BIOS-based machines:
# grub2-mkconfig -o /boot/grub2/grub.cfg

Figure 11 shows the added PMEM statement in the GRUB file. Figure 12 shows the instructions to make the GRUB configuration.

 Define PMEM regions in the /etc/default/grub file.
Figure 11:Define PMEM regions in the /etc/default/grub file.

 Generate the boot configuration file bases on the grub template.
Figure 12:Generate the boot configuration file bases on the grub template.

After the machine reboots, you should be able to see the emulated device as /dev/pmem0…pmem3. Trying to get reserved memory regions for persistent memory emulation will result in split memory ranges defining persistent (type 12) regions as shown in Figure 13. A general recommendation would be to either use memory from the 4GB+ range (memmap=nnG!4G) or to check the e820 memory map upfront and fitting within. If you don’t see the device, verify the memmap setting correctness in the grub file as shown in Figure 9, followed by dmesg(1) analysis as shown in Figure 13. You should be able to see reserved ranges as shown on the dmesg output snapshot: dmesg.

 Persistent memory regions are highlighted as (type 12).
Figure 13:Persistent memory regions are highlighted as (type 12).

You'll see that there can be multiple non-overlapping regions reserved as a persistent memory. Putting multiple memmap="...!..." entries will result in multiple devices exposed by the kernel and visible as /dev/pmem0, /dev/pmem1, /dev/pmem2, …

DAX - Direct Access

The DAX (direct access) extensions to the filesystem creates a PMEM-aware environment. Some distros, such as Fedora* 24 and later, already have DAX/PMEM built in as a default, and have NVML available as well. One quick way to check to see if the kernel has DAX and PMEM built into it is to grep the kernel’s config file which is usually provided by the distro under /boot. Use the command below:

# egrep ‘(DAX|PMEM)’ /boot/config-`uname –r`

The result should be something like:

CONFIG_X86_PMEM_LEGACY_DEVICE=y
CONFIG_X86_PMEM_LEGACY=y
CONFIG_BLK_DEV_RAM_DAX=y
CONFIG_BLK_DEV_PMEM=m
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
CONFIG_ARCH_HAS_PMEM_API=y

To install a filesystem with DAX (available today for ext4 and xfs):

# mkdir /mnt/pmemdir
# mkfs.ext4 /dev/pmem3
# mount -o dax /dev/pmem3 /mnt/pmemdir
Now files can be created on the freshly mounted partition, and given as an input to NVML pools.

 Persistent memory blocks.
Figure 14:Persistent memory blocks.

 Making a file system.
Figure 15:Making a file system.

It is additionally worth mentioning that you can emulate persistent memory with ramdisk (i.e., /dev/shm) or force PMEM-like behavior by setting environment variable PMEM_IS_PMEM_FORCE=1. This would eliminate performance hit caused by msync(2).

Conclusion

By now, you know how to set up an environment where you can build a PMEM application without actual PMEM hardware. With the additional cores on an Intel® architecture server, you can quickly build a new kernel with PMEM support for your emulation environment.

References

Author(s)

Thai Le is the software engineer focusing on cloud computing and performance computing analysis at Intel Corporation.

Introduction to Heterogeneous Streams Library

$
0
0

Introduction

To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms, and to coordinate all these activities.

To relieve designers of the burden of implementing the necessary infrastructures, the Heterogeneous Streaming (hStreams) library provides a set of well-defined APIs to support a task-based parallelism model on heterogeneous platforms. hStreams explores the use of the Intel® Coprocessor Offload Infrastructure (Intel® COI) to implement these infrastructures. That is, the host decomposes the workload into tasks, one or more tasks are executed in separate targets, and finally the host gathers the results from all of the targets. Note that the host can also be a target too.

Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.6 contains the hStreams library, documentation, and sample codes. Starting from Intel MPSS 3.7, hStreams is removed from Intel MPSS software and becomes an open source project. The current version 1.0 supports the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor as targets. hStreams binaries version 1.0.0 can be downloaded:

Users can contribute to hStreams development at https://github.com/01org/hetero-streams. The following tables summarize the tools that support hStreams in Linux and Windows:

Name of Tool (Linux*)Supported Version

Intel® Manycore Platform Software Stack

3.4, 3.5, 3.6, 3.7

Intel® C++ Compiler

15.0, 16.0

Intel® Math Kernel Library

11.2, 11.3

 

Name of Tool (Windows*)Supported Version

Intel MPSS

3.4, 3.5, 3.6, 3.7

Intel C++ Compiler

15.0, 16.0

Intel Math Kernel Library

11.2, 11.3

Visual Studio*

11.0 (2012)

This whitepaper briefly introduces hStreams and highlights its concepts. For a full description, readers are encouraged to read the tutorial included in the hStreams package mentioned above.

Execute model concepts

This section highlights some basic concepts of hStreams: source and sink, domains, streams, buffers, and actions:

  • Streams are FIFO queues where actions are enqueued. Streams are associated with logical domains. Each stream has two endpoints: source and sink, which is bound to a logical domain.
  • Source is where the work is enqueued and sink where the work is executed. In the current implementation, the source process runs on an Intel Xeon processor-based machine, and the sink process runs on a machine that can be the host itself, an Intel Xeon Phi coprocessor, or in the future, even any hardware platform. The library allows the source machine to invoke the user’s defined function on the target machine.
  • Domains represent the resources of hetero platforms. A physical domain is the set of all resources available in a platform (memory and computing). For example, an Intel Xeon processor-based machine and an Intel Xeon Phi coprocessor are two different physical domains. A logical domain is a subset of a given physical domain; it uses any subset of available cores in a physical domain. The only restriction is that two logical domains cannot be partially overlapping.
  • Buffers represent memory resources to transfer data between source and sink. In order to transfer data, the user must create a buffer by calling an appropriate API, and a corresponding physical buffer is instantiated at the sink. Buffers can have properties such as memory type (for example, DDR or HBW) and affinity (for example, sub-NUMA clustering).
  • Actions are requests to execute functions at the sinks (compute action), to transfer data from source to sink or vise-versa (memory movement action), and to synchronize tasks among streams (synchronization action). Actions enqueued in a stream are proceeded in first in, first out (FIFO) semantics: The source places the action in and the sink removes the action. All actions are non-blocking (asynchronous) and have completion events. Remote invocation can be user-defined functions or optimized convenient functions (for example, dgemm). Thus, a FIFO stream queue handles dependencies within a stream while synchronization actions handle dependencies among streams.

In a typical scenario, the source-side code allocates stream resources, allocates memory, transfers data to the sink, invokes the sink to execute a predefined function, handles synchronization, and eventually terminates streams. Note that actions such as data transferring, remote invocation, and synchronization are handled in FIFO streams. The sink-side code simply executes the function that the source requested.

For example, consider the pseudo-code of a simple hStreams application that creates two streams, the source transfers data to the sinks, performs remote invocation at the sinks, and then transfers results back to the source host:

Step 1: Initialize two streams 0 and 1

Step 2: Allocate buffers A0, B0, C0, A1, B1, C1

Step 3: Use stream i, transfer memory Ai, Bi to sink (i=0,1)

Step 4: Invoke remote computing in stream i: Ai + Bi -> Ci (i=0,1)

Step 5: Transfer memory Ci back to host (i=0,1)

Step 6: Synchronize

Step 7: Terminate streams

The following figure illustrates the actions generated at the host:

Actions are placed in the corresponding streams and removed at the sinks:

hStreams provides two levels of APIs: the app API and the core API. The app API offers simple interfaces; it is targeted to novice users to quickly ramp on hStreams library. The core API gives advanced users the full functionality of the library. The app APIs in fact call the core layer APIs, which in turn use Intel COI and the Symmetric Communication Interface (SCIF). Note that users can mix these two levels of API when writing their applications. For more details on the hStreams API, refer to the document Programing Guide and API Reference. The following figure illustrates the relation between the hStreams app API and the core API.

Refer to the document “Hetero Streams Library 1.0 Programing Guide and API” and the tutorial included in the hStreams download package for more information.

Building and running a sample hStreams program

This section illustrates a sample code that makes use of the hStreams app API. It also demonstrates how to build and run the application. The sample code is an MPI program running on an Intel Xeon processor host with two Intel Xeon Phi coprocessors connected.

First, download the package from https://github.com/01org/hetero-streams. Then, follow the instruction to build and install the hStreams library on an Intel Xeon processor-based host machine that runs Intel MPSS 3.7.2 in this case. This host machine has two Intel Xeon Phi coprocessors installed and connects to a remote Intel Xeon processor-based machine. This remote machine (10.23.3.32) also has two Intel Xeon Phi coprocessors.

This sample code creates two streams; each stream runs explicitly on a separate coprocessor. An MPI rank manages these two streams.

The application consists of two parts: The source-side code is shown in Appendix A and the corresponding sink-side code is shown in Appendix B. The sink-side code contains a user-defined function vector_add, which is to be invoked by the source.

This sample MPI program is designed to run with two MPI ranks. Each MPI rank runs on a different domain (Intel Xeon processor host) and initializes two streams; each stream is responsible for communicating with a coprocessor. The MPI ranks enqueues the required actions into the streams in the following order: Memory transfer action from source to sink action, remote invocation action, and memory transfer action from sink to source. The following app APIs are called in the source-side code:

  • hStreams_app_init: Initialize and create streams across all available Intel Xeon Phi coprocessors. This API assumes one logical domain per physical domain.
  • hStreams_app_create_buf: Create an instantiation of buffers in all currently existing logical domains.
  • hStreams_app_xfer_memory: Enqueue memory transfer action in a stream; depending on the specified direction, memory is transferred from source to sink or sink to source.
  • hStreams_app_invoke: Enqueue a user-defined function in a stream. This function is executed at the stream sink. Note that the user also needs to implement the remote target function in the sink-side program.
  • hStreams_app_event_wait: This sync action blocks until the set of specified events is completed. In this example, only the last transaction in a stream is required, since all other actions should be completed.
  • hStreams_app_fini: Destroy hStreams internal structures and clear the library state.

Intel MPSS 3.7.2 and Intel® Parallel Studio XE 2016 update 3 are installed on the host machine Intel® Xeon® processor E5-2600. First, bring the Intel MPSS service up and set up compiler environment variables on the host machine:

$ sudo service mpss start

$ source /opt/intel/composerxe/bin/compilervars.sh intel64

To compile the source-side code, link the source-side code with the dynamic library hstreams_source which provides source functionality:

$ mpiicpc hstream_sample_src.cpp –O3 -o hstream_sample -lhstreams_source \   -I/usr/include/hStreams -qopenmp

The above command generates the executable hstream_sample. To generate the user kernel library for the coprocessor (as sink), compile with the flag –mmic:

$ mpiicpc -mmic -fPIC -O3 hstream_sample_sink.cpp –o ./mic/hstream_sample_mic.so \    -I/usr/include/hStreams -qopenmp -shared

To follow the convention, the target library takes the form <exec_name>_mic.so for the Intel Xeon Phi coprocessor and <exec_name>_host.so for the host. This generates the library named hstream_sample_mic.so under the folder /mic.

To run this application, set the environment variable SINK_LD_LIBRARY_PATH so that hStreams runtime can find the user kernel library hstream_sample_mic.so

$ export SINK_LD_LIBRARY_PATH=/opt/mpss/3.7.2/sysroots/k1om-mpss-linux/usr/lib64:~/work/hStreams/collateral/delivery/mic:$MIC_LD_LIBRARY_PATH

Run this program with two ranks, one rank running on this current host and one rank running on the host whose IP address is 10.23.3.32, as follows:

$ mpiexec.hydra -n 1 -host localhost ~/work/hstream_sample : -n 1 -wdir ~/work -host 10.23.3.32 ~/work/hstream_sample

Hello world! rank 0 of 2 runs on knightscorner5
Hello world! rank 1 of 2 runs on knightscorner0.jf.intel.com
Rank 0: stream 0 moves A
Rank 0: stream 0 moves B
Rank 0: stream 1 moves A
Rank 0: stream 1 moves B
Rank 0: compute on stream 0
Rank 0: compute on stream 1
Rank 0: stream 0 Xtransfer data in C back
knightscorner5-mic0
knightscorner5-mic1
Rank 1: stream 0 moves A
Rank 1: stream 0 moves B
Rank 1: stream 1 moves A
Rank 1: stream 1 moves B
Rank 1: compute on stream 0
Rank 1: compute on stream 1
Rank 1: stream 0 Xtransfer data in C back
knightscorner0-mic0.jf.intel.com
knightscorner0-mic1.jf.intel.com
Rank 0: stream 1 Xtransfer data in C back
Rank 1: stream 1 Xtransfer data in C back
sink: compute on sink in stream num: 0
sink: compute on sink in stream num: 0
sink: compute on sink in stream num: 1
sink: compute on sink in stream num: 1
C0=97.20 C1=90.20 C0=36.20 C1=157.20 PASSED!

Conclusion

hStreams provides a well-defined set of APIs allowing users to design a task-based application on heterogeneous platforms quickly. Two levels of hStreams API co-exist: The app API offers simple interfaces for novice users to quickly ramp on the hStreams library, and the core API gives advanced users the full functionality of the rich library. This paper presents some basic hStreams concepts and illustrates how to build and run an MPI program that takes advantages of the hStreams interface.

 

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

libstdc++ Source Files

$
0
0

Please find the libstdc++ sources used by PSET for the Linux* product here.


Celebrating 10 Years of Intel® Threading Building Blocks

$
0
0

 Intel TBBWhat a Journey It's Been.

Intel® Threading Building Blocks (Intel® TBB) has come a long way from where it started in 2006 to its10-year anniversary in 2016. But on this long and winding journey, we've never lost sight of our core values of innovation and customer satisfaction.

Intel TBB is a powerful tool that lets developers leverage multi-core performance and heterogeneous computing without having to be threading or parallel programming experts. It is:

  • A tool to parallelize computationally intensive work, delivering higher-level and simpler solutions using standard C++.
  • The most feature-rich and comprehensive solution for parallel application development
  • Highly portable, composable, affordable, and approachable, providing future-proof scalability.
  • Compile agnostic, supporting multiple operating systems, and optimized for all Intel® architectures

If you've been with us on this journey, we thank you for your help and support in making Intel TBB the best tool it can be.

If you haven't yet seen what Intel TBB can do for you, now's the time:

We're looking forward to the road ahead.

 

How University of Bristol Accelerated Rational Drug Design

$
0
0

Task-based parallel programming is the future. The University of Bristol Advanced Computing Research Centre wants to be part of that future. It provides advanced computing support to researchers, with a team of research software engineers who work with academics across a range of disciplines to help optimize research software that can be applied in industry.

With help from Intel® Threading Building Blocks (Intel® TBB), the University is able to provide a simple abstraction that will enable research software to adapt to the massively multicore future. To perform some of the calculations needed for drug design, the University uses the LigandSwap* program with a task-based parallel programming approach―with help from Intel TBB and its efficient task scheduling. The researchers found that parallelizing LigandSwap using Intel TBB can take less than 100 lines of Intel TBB-specific code from a code base of more than 100,000 lines—and enable a calculation that would ordinarily take 25 days to complete in just one day.

Learn all about it in the new University of Bristol case study.
 

Advanced Bitrate Control Methods in Intel® Media SDK

$
0
0

Introduction

In the world of media, there is a great demand to increase encoder quality but this comes with tradeoffs between quality and bandwidth consumption. This article addresses some of those concerns by discussing advanced bitrate control methods, which provide the ability to increase quality (relative to legacy rate controls) while maintaining the bitrate constant using Intel® Media SDK/ Intel® Media Server Studio tools.

The Intel Media SDK encoder offers many bitrate control methods, which can be divided into legacy and advanced/special purpose algorithms. This article is the 2nd part of 2-part series of Bitrate Control Methods in  Intel® Media SDK. The legacy rate control algorithms are detailed in the 1st part, which is Bitrate Control Methods (BRC) in Intel® Media SDK; the advanced rate control methods (summarized in the table below) will be explained in this article.

Rate Control

HRD/VBV Compliant

OS supported

Usage

LA

No

Windows/Linux

Storage transcodes

LA_HRD

Yes

Windows/Linux

Storage transcodes; Streaming solution (where low latency is not a requirement)

ICQ

No

Windows

Storage transcodes (better quality with smaller file size)

LA_ICQ

No

Windows

Storage transcodes

The following tools (along with the downloadable links) are what we used to explain the concepts and generate performance data for this article: 

Look Ahead (LA) Rate Control

As the name explains, this bitrate control method looks at successive frames, or the frames to be encoded next, and stores them in a look-ahead buffer. The number of frames or the length of the look ahead buffer can be specified by the LookAheadDepth parameter. This rate control is recommended for transcoding/encoding in a storage solution.

Generally, many parameters can be used to modify the quality/performance of the encoded stream.  In this particular rate control, the encoding performance can be controlled by changing the size of the look ahead buffer. The LookAheadDepth parameter value can be changed between 10 - 100 to specify the size of the look ahead buffer. The LookAheadDepth parameter specifies the number of frames that the SDK encoder analyzes before encoding. As the LookAheadDepth increases, so does the number of frames that the encoder looks into; this results in an increase in quality of the encoded stream, however the performance (encoding frames per second) will decrease. In our experiments, this performance tradeoff was negligible for small input streams such as SIntel1080p.

Look Ahead rate control is enabled by default in sample_encode and sample_multi_transcode, part of code samples. The example below describes how to use this rate control method using the sample_encode application.

sample_encode.exe h264 -i sintel_1080p.yuv -o LA_out.264 -w 1920 -h 1080 -b 10000 –f 30 -lad 100 -la

As the value of LookAheadDepth increases, encoding quality improves, because the number of frames stored in the look ahead buffer has also increased, and the encoder will have more visibility to upcoming frames.

It should be noted that LA is not HRD (Hypothetical Reference Decoder) compliant. The following picture, obtained from Intel® Video Pro Analyzer shows a HRD buffer fullness view with “Buffer” mode enabled where sub-mode “HRD” is greyed out. This means no HRD parameters were passed in the stream headers, which indicates LA rate control is not HRD compliant. The left axis of the plot shows frame sizes and the right axis of the plot shows the slice QP (Quantization Parameter) values.

LA BRC
Figure 1: Snapshot of Intel Video Pro Analyzer analyzing H264 stream(Sintel -1080p), encoded using LA rate control method.

 

Sliding Window condition:

Sliding window algorithm is a part of the Look Ahead rate control method. This algorithm is applicable for both LA and LA_HRD rate control methods by defining WinBRCMaxAvgKbps and WinBRCSize through the mfxExtCodingOption3 structure.

Sliding window condition is introduced to strictly constrain the maximum bitrate of the encoder by changing two parameters: WinBRCSize and WinBRCMaxAvgKbps. This helps in limiting the achieved bitrate which makes it a good fit in limited bandwidth scenarios such as live streaming.

  • WinBRCSize parameter specifies the sliding window size in frames. A setting of zero means that sliding window condition is disabled.
  • WinBRCMaxAvgKbps specifies the maximum bitrate averaged over a sliding window specified by WinBRCSize.

In this technique, the average bitrate in a sliding window of WinBRCSize must not exceed WinBRCMaxAvgKbps. The above condition becomes weaker as the sliding window size increases and becomes stronger if the sliding window size value decreases. Whenever this condition fails, the frame will be automatically re-encoded with a higher quantization parameter and performance of the encoder decreases as we keep encountering failures. To reduce the number of failures and to avoid re-encoding, frames within the look ahead buffer will be analyzed by the encoder. A peak will be detected when there is a condition failure by encountering a large frame in the look ahead buffer. Whenever a peak is predicted, the quantization parameter value will be increased, thus reducing the frame size.

Sliding window can be implemented by adding the following code to the pipeline_encode.cpp program in the sample_encode application.

m_CodingOption3.WinBRCMaxAvgKbps = 1.5*TargetKbps;
m_CodingOption3.WinBRCSize = 90; //3*framerate
m_EncExtParams.push_back((mfxExtBuffer *)&m_CodingOption3);

The above values were chosen when encoding sintel_1080p.yuv of 1253 frames with H.264 codec, TargetKbps = 10000, framerate = 30fps. Sliding window parameter values (WinBRCMaxAvgKbps and WinBRCSize) are subject to change when using different input options.

If WinBRCMaxAvgKbps is close to TargetKbps and WinBRCSize almost equals 1, the sliding window will degenerate into the limitation of the maximum frame size (TargetKbps/framerate).

Sliding window condition can be evaluated by checking in any WinBRCSize consecutive frames, the total encoded size doesn't exceed the value set by WinBRCMaxAvgKbps. The following equation explains the sliding window condition.

The condition of limiting frame size can be checked after the asynchronous encoder run and encoded data is written back to the output file in pipeline_encode.cpp.

Look Ahead with HRD Compliance (LA_HRD) Rate Control

As Look Ahead bitrate control is not HRD compliant, there is a dedicated mode to achieve HRD compliance with the LookAhead algorithm, known as LA_HRD mode (MFX_RATECONTROL_LA_HRD). With HRD compliance, the Coded Picture Buffer should neither overflow nor underflow. This rate control is recommended in storage transcoding solutions and streaming scenarios, where low latency is not a major requirement.

To use this rate control in sample_encode, it will require code changes as illustrated below -

Statements to be added in sample_encode.cpp file within ParseInputString() function

else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-hrd")))
pParams->nRateControlMethod = MFX_RATECONTROL_LA_HRD;

LookAheadDepth value can be mentioned in the command line when executing the sample_encode binary. The example below describes how to use this rate control method using the sample_encode application.

sample_encode.exe h264 -i sintel_1080p.yuv -o LA_out.264 -w 1920 -h 1080 -b 10000 –f 30 -lad 100 –hrd

In the following graph, the LookAheadDepth(lad) value is 100.                                                                                          

Look Ahead HRD

Figure 2: a snapshot of Intel® Video Pro Analyzer(VPA), which verifies that LA_HRD rate control is HRD compliant. The buffer fullness mode is activated by selecting “Buffer” mode and “HRD” is chosen in sub-mode.

The above figure shows HRD buffer fullness view with “Buffer” mode enabled in Intel VPA, in which the sub-mode “HRD” is selected. The horizontal red lines show the upper and lower limits of the buffer and green line shows the instantaneous buffer fullness. The buffer fullness didn’t cross the upper and lower limits of the buffer. This means neither overflow nor underflow occurred in this rate control.

Extended Look Ahead (LA_EXT) Rate Control

For 1:N transcoding scenarios (1 decode and N encode session), there is an optimized lookahead algorithm knows as Extended Look Ahead Rate Control algorithm (MFX_RATECONTROL_LA_EXT), available only in Intel® Media Server Studio (not part of the Intel® Media SDK). This is recommended for broadcasting solutions.

An application should be able to load the plugin ‘mfxplugin64_h264la_hw.dll’ to support MFX_RATECONTROL_LA_EXT. This plugin can be found in the following location in the local system, where the Intel® Media Server Studio is installed.

  • “\Program Installed\Software Development Kit\bin\x64\588f1185d47b42968dea377bb5d0dcb4”.

The path of this plugin needs to be mentioned explicitly because it is not part of the standard installation directory. This capability can be used in either of two ways:

  1. Preferred Method - Register the plugin with registry and point all necessary attributes such as API version, plugin type, path etc; so the dispatcher, which is a part of the software, can find it through the registry and connect to a decoding/encoding session.
  2. Have all binaries (Media SDK, plugin, and app) in a directory and execute from the same directory.

LookAheadDepth parameter must be mentioned only once and considered to be the same value of LookAheadDepth of all N transcoded streams. LA_EXT rate control can be implemented using sample_multi_transcode, below is the example cmd line - 

sample_multi_transcode.exe -par file_1.par

Contents of the par file are

-lad 40 -i::h264 input.264 -join -la_ext -hw_d3d11 -async 1 -n 300 -o::sink
-h 1088 -w 1920 -o::h264 output_1.0.h264 -b 3000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
-h 1088 -w 1920 -o::h264 output_2.h264 -b 5000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
-h 1088 -w 1920 -o::h264 output_3.h264 -b 7000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
-h 1088 -w 1920 -o::h264 output_4.h264 -b 10000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300

Intelligent Constant Quality (ICQ) Rate Control

The ICQ bitrate control algorithm is designed to improve subjective video quality of an encoded stream: it may or may not improve video quality objectively - depending on the content. ICQQuality is a control parameter which defines the quality factor for this method. ICQQuality parameter can be changed between 1 - 51, where 1 corresponds to the best quality. The achieved bitrate and encoder quality (PSNR) can be adjusted by increasing or decreasing ICQQuality parameter. This rate control is recommended for storage solutions, where high quality is required while maintaining a smaller file size.

To use this rate control in sample_encode, it will require code changes as explained below - 

Statements to be added in sample_encode.cpp within ParseInputString() function

else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-icq")))
pParams->nRateControlMethod = MFX_RATECONTROL_ICQ;

ICQQuality is available in the mfxInfoMFX structure. The desired value can be entered for this variable in InitMfxEncParams() function, e.g.: 

m_mfxEncParams.mfx.ICQQuality = 12;

The example below describes how to use this rate control method using the sample_encode application.

sample_encode.exe h264 -i sintel_1080p.yuv -o ICQ_out.264 -w 1920 -h 1080 -b 10000 -icq
VBR vs ICQ RD Graph
Figure 3: Using Intel Media SDK samples and Video Quality Caliper, compare VBR and ICQ (ICQQuality varied between 13 and 18) with H264 encoding for 1080p, 30fps sintel.yuv of 1253 frames

Using about the same bitrate, ICQ shows improved Peak Signal to Noise Ratio (PSNR) in the above plot. The RD-graph data for  the above plot is captured using the Video Quality Caliper, which compares two different streams encoded with ICQ and VBR.

Observation from above performance data:

  • At the same achieved bitrate, ICQ shows much improved quality (PSNR) compared to VBR, while maintaining the same encoding FPS.
  • The encoding bitrate and quality of the stream decreases as the ICQQuality parameter value increases.

The snapshot below shows a subjective comparison between encoded frames using VBR (on the left) and ICQ (on the right). Highlighted sections demonstrate missing details in VBR and improvements in ICQ.

VBR and ICQ subjective comparison
Figure 4: Using Video Quality Caliper, compare encoded frames subjectively for VBR vs ICQ

 

Look Ahead & Intelligent Constant Quality (LA_ICQ) Rate Control

This method is the combination of ICQ with Look Ahead.  This rate control is also recommended for storage solutions. ICQQuality and LookAheadDepth are the two control parameters where the qualify factor is specified by mfxInfoMFX::ICQQuality and look ahead depth is controlled by the  mfxExtCodingOption2: LookAheadDepth parameter.

To use this rate control in sample_encode, it requires code changes as explained below - 

Statements to be added in sample_encode.cpp within ParseInputString() function

else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-laicq")))
pParams->nRateControlMethod = MFX_RATECONTROL_LA_ICQ;

ICQQuality is available in the mfxInfoMFX structure. Desired values can be entered for this variable in InitMfxEncParams() function

m_mfxEncParams.mfx.ICQQuality = 12;

LookAheadDepth can be mentioned in command line as lad.

sample_encode.exe h264 -i sintel_1080p.yuv -o LAICQ_out.264 -w 1920 -h 1080 -b 10000 –laicq -lad 100
VBR vs LAICQ RD-graph
Figure 5: Using Intel Media SDK samples and Video Quality Caliper, compare VBR and LA_ICQ (LookAheadDepth 100, ICQQuality varied between 20 and 26) with H264 encoding for 1080p, 30fps sintel.yuv of 1253 frames

At similar bitrate, better PSNR is observed for LA_ICQ compared to VBR as shown in the above plot. By keeping LookAheadDepth value at 100, the ICQQuality parameter value was changed between 1 - 51. The RD-graph data for this plot was captured using the Video Quality Caliper, which compares two different streams encoded with LA_ICQ and VBR.

Conclusion

There are several advanced bitrate control methods available to play with, to see if higher quality encoded streams can be achieved while maintaining bandwidth requirements constant.  Each rate control has its own advantages and can be used in specific industry level use-cases depending on the requirement. To implement the Bitrate Control methods, refer also to the Intel® Media SDK Reference Manual, which comes with an installation of the Intel® Media SDK or Intel® Media Server Studio, and the Intel® Media Developer’s Guide from the documentation website. Visit Intel’s media support forum for further questions.

Resources

Improve Vectorization Performance using Intel® Advanced Vector Extensions 512

$
0
0

This article shows a simple example of a loop that was not vectorized by the Intel® C++ Compiler due to possible data dependencies, but which has now been vectorized using the Intel® Advanced Vector Extensions 512 instruction set on an Intel® Xeon Phi™ processor. We will explore why the compiler using this instruction set automatically recognizes the loop as vectorizable and will discuss some issues about the vectorization performance.

Introduction

When optimizing code, the first efforts should be focused on vectorization. The most fundamental way to efficiently utilize the resources in modern processors is to write code that can run in vector mode by taking advantage of special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. Data parallelism in the algorithm/code is exploited in this stage of the optimization process.

Making the most of fine grain parallelism through vectorization will allow the performance of software applications to scale with the number of cores in the processor by using multithreading and multitasking. Efficient use of single-core resources will be critical in the overall performance of the multithreaded application, because of the multiplicative effect of vectorization and multithreading.

The new Intel® Xeon Phi™ processor features 512-bit wide vector registers. The new Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set architecture (ISA), which is supported by the Intel Xeon Phi processor (and future Intel® processors), offers support for vector-level parallelism, which allows the software to use two vector processing units (each capable of simultaneously processing 16 single precision (32-bit) or 8 double precision (64-bit) floating point numbers) per core. Taking advantage of these hardware and software features is the key to optimal use of the Intel Xeon Phi processor.

This document describes a way to take advantage of the new Intel AVX-512 ISA in the Intel Xeon Phi processor. An example of an image processing application will be used to show how, with Intel AVX-512, the Intel C++ Compiler now automatically vectorizes a loop that was not vectorized with Intel® Advanced Vector Extensions 2 (Intel® AVX2). We will discuss performance issues arising with this vectorized code.

The full specification of the Intel AVX-512 ISA consists of several subsets. Some of those subsets are available in the Intel Xeon Phi processor. Some subsets will also be available in future Intel® Xeon® processors. A detailed description of the Intel AVX-512 subsets and their presence in different Intel processors is described in (Zhang, 2016).

In this document, the focus will be on the subsets of the Intel AVX-512 ISA, which provides vectorization functionality present both in current Intel Xeon Phi processor and future Intel Xeon processors. These subsets include the Intel AVX-512 Foundation Instructions (Intel AVX-512F) subset (which provides core functionality to take advantage of vector instructions and the new 512-bit vector registers) and the Intel AVX-512 Conflict Detection Instructions (Intel AVX-512CD) subset (which adds instructions that detect data conflicts in vectors, allowing vectorization of certain loops with data dependences).

Vectorization Techniques

There are several ways to take advantage of vectorization capabilities on an Intel Xeon Phi processor core:

  • Use optimized/vectorized libraries, like the Intel® Math Kernel Library (Intel® MKL).
  • Write vectorizable high-level code, so the compiler will create corresponding binary code using the vector instructions available in the hardware (this is commonly called automatic vectorization).
  • Use language extensions (compiler intrinsic functions) or direct calling to vector instructions in assembly language.

Each one of these methods has advantages and disadvantages, and which method to use will depend on the particular case we are working with. This document focuses on writing vectorizable code, which lets our code be more portable and ready for future processors. We will explore a simple example (a histogram) for which the new Intel AVX-512 instruction set will create executable code that will run in vector mode on the Intel Xeon Phi processor. The purpose of this example is to give insight on why the compiler can now vectorize source code containing data dependencies using the Intel AVX-512 ISA, which was not recognized as vectorizable when the compiler uses previous instruction sets, like Intel AVX2. Detailed information about Intel® AVX-512 ISA can be found at (Intel, 2016).

In future documents, techniques to explicitly guide vectorization using the language extensions and compiler intrinsics will be discussed. Those techniques will be helpful in complex loops for which the compiler is not able to safely vectorize the code due to complex flow or data dependencies. However the relatively simple example shown in this document will be helpful in understanding how the compiler is using the new features present in the AVX-512 ISA to improve the performance of some common loop structures.

Example: histogram computation in images.

To understand the new features offered by the AVX512F and AVX512CD subsets, we will use the example of computing an image histogram.

An image histogram is a graphical representation of the distribution of pixel values in an image (Wikipedia, n.d.). The pixel values can be single scalars representing grayscale values or vectors containing values representing colors, as in RGB images (where the color is represented using a combination of three values: red, green, and blue).

In this document, we used a 3024 x 4032 grayscale image. The total number of pixels in this image is 12,192,768. The original image and the corresponding histogram (computed using 1-pixel 256 grayscale intensity intervals) are shown in Figure 1.

 Alberto Villarreal), and its corresponding histogram.
Figure 1: Image used in this document (image credit: Alberto Villarreal), and its corresponding histogram.

A basic algorithm to compute the histogram is the following:

  1. Read image
  2. Get number of rows and columns in the image
  3. Set image array [1: rows x columns] to image pixel values
  4. Set histogram array [0: 255] to zero
  5. For every pixel in the image
    {
           histogram [ image [ pixel ] ] = histogram [ image [ pixel ] ] + 1
    }

Notice that in this basic algorithm, the image array is used as an index to the histogram array (a type conversion to an integer is assumed). This kind of indirect referencing cannot be unconditionally parallelized, because neighboring pixels in the image might have the same intensity value, in which case the results of processing more than one iteration of the loop simultaneously might be wrong.

In the next sections, this algorithm will be implemented in C++, and it will be shown that the compiler, when using the AVX-512 ISA, will be able to safely vectorize this structure (although only in a partial way, with performance depending on the image data).

It should be noticed that this implementation of a histogram computation is used in this document for pedagogical purposes only. It does not represent an efficient way to perform the histogram computation, for which there are efficient libraries available. Also, our purpose is to show, using a simple code, how the new AVX-512 ISA is adding vectorization opportunities, and to help us understand the new functionality provided by the AVX-512 ISA.

There are other ways to implement parallelism for specific examples of histogram computations. For example in (Colfax International, 2015) the authors describe a way to automatically vectorize a similar algorithm (a binning application) by modifying the code using a strip-mining technique.

Hardware

To test our application, the following system will be used:

Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272

The information above can be checked in a Linux* system using the command

cat /proc/cpuinfo.

Notice that when using the command shown above, the “flags” section in the output will include the “avx512f” and “avx512cd” processor flags. Those flags indicate that Intel AVX512F and Intel AVX512CD subsets are supported by this processor. Notice that the flag “avx2” is defined also, which means the Intel AVX2 ISA is also supported (although it does not take advantage of the 512-bit vector registers in this processor).

Vectorization Results Using The Intel® C++ Compiler

This section shows a basic vectorization analysis of a fragment of the histogram code. Specifically, two different loops in this code will be analyzed:

LOOP 1: A loop implementing a histogram computation only. This histogram is computed on the input image, stored in floating point single precision in array image1.

LOOP 2: A loop implementing a convolution filter followed by a histogram computation. The filter is applied to the original image in array image1 and then a new histogram is computed on the filtered image stored in array image2.

The following code section shows the two loops mentioned above (image and histogram data have been placed in aligned arrays):

// LOOP 1

#pragma vector aligned
for (position=cols; position<rows*cols-cols; position++)
{
         hist1[ int(image1[position]) ]++;
}

(…)

// LOOP 2

#pragma vector aligned
for (position=cols; position<rows*cols-cols; position++)
{
           if (position%cols != 0 || position%(cols-1) != 0)
{
image2[position] = ( 9.0f*image1[position]
                                              - image1[position-1]
                                              - image1[position+1]
                                              - image1[position-cols-1]
                                              - image1[position-cols+1]
                                              - image1[position-cols]
                                              - image1[position+cols-1]
                                              - image1[position+cols+1]
                                              - image1[position+cols]) ;
                }
               if (image2[position] >= 0 && image2[position] <= 255)
               hist2[ int(image2[position]) ]++;
 }

This code was compiled using Intel C++ Compiler’s option to generate an optimization report as follows:

icpc histogram.cpp -o histogram -O3 -qopt-report=2 -qopt-report-phase=vec -xCORE-AVX2

Note that, in this case, the -xCORE-AVX2 compiler flag has been used to ask the compiler to use the Intel AVX2 ISA to generate executable code.

 

The section of the optimization report that the compiler created for the loops shown above looks like this:

LOOP BEGIN at histogram.cpp(92,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between  line 94 and  line 94
LOOP END

LOOP BEGIN at histogram.cpp(92,5)
<Remainder>
LOOP END

LOOP BEGIN at histogram.cpp(103,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed FLOW dependence between  line 118 and  line 118
LOOP END

As can be seen in the section of the optimization report shown above, the compiler has prevented vectorization in both loops, due to dependences present in the lines of code where histogram computations are taking place (lines 94 and 118).

Now let’s compile the code using the -xMIC-AVX512 flag, to indicate the compiler to use the Intel AVX-512 ISA:

icpc histogram.cpp -o histogram -O3 -qopt-report=2 -qopt-report-phase=vec -xMIC-AVX512

This creates the following output for the code segment in the optimization report, showing that both loops have now been vectorized:

LOOP BEGIN at histogram.cpp(92,5)
   remark #15300: LOOP WAS VECTORIZED

   LOOP BEGIN at histogram.cpp(94,8)
      remark #25460: No loop optimizations reported
   LOOP END
LOOP END

LOOP BEGIN at histogram.cpp(92,5)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED

   LOOP BEGIN at histogram.cpp(94,8)
      remark #25460: No loop optimizations reported
   LOOP END
LOOP END

LOOP BEGIN at histogram.cpp(103,5)
   remark #15300: LOOP WAS VECTORIZED

   LOOP BEGIN at histogram.cpp(118,8)
      remark #25460: No loop optimizations reported
   LOOP END
LOOP END

LOOP BEGIN at histogram.cpp(103,5)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED

The compiler reports results can be summarized as follows:

  • LOOP 1, which implements a histogram computation, is not being vectorized using the Intel AVX2 flag because of an assumed dependency (which was described in section 3 in this document). However, the loop was vectorized when using the Intel AVX-512 flag, which means that the compiler has solved the dependency using instructions present in the Intel AVX-512 ISA.
  • LOOP 2 gets the same diagnostics as LOOP1. The difference between these two loops is that LOOP 2 adds, on top of the histogram computation, a filter operation that has no dependencies and would be vectorizable otherwise. The presence of the histogram computation is preventing the compiler from vectorizing the entire loop (when using the Intel AVX2 flag).

Note: As can be seen in the section of the optimization report shown above, the compiler split the loop into two sections: The main loop and the reminder loop. The remainder loop contains the last few iterations in the loop (those that do not completely fill the vector unit). The compiler will usually do this, unless it knows in advance that the total number of iterations for this loop will be a multiple of the vector length.

We will ignore the reminder loop in this document. Ways to improve performance by eliminating the reminder loop are described in the literature.

Analyzing Performance of The Code

Performance of the above code segment was analyzed by adding timing instructions at the beginning and at the end of each one of the two loops, so that the time spent in each loop can be compared between different executables generated using different compiler options.

The table below shows the timing results of executing, on a single core, the vectorized and non-vectorized versions of the code (results are the average of 5 executions) using the input image without preprocessing. Baseline performance is defined here as the performance of the non-vectorized code generated by the compiler when using the Intel AVX2 compiler flag.

Test case

Loop

Baseline (Intel® Advanced Vector Extensions 2)

Speedup Factor with Vectorization (Intel® Advanced Vector Extensions 512)

Input image

LOOP 1

1

2.2

 

LOOP 2

1

7.0

To further analyze the performance of the code as a function of the input data, the input image was preprocessed using blurring and sharpening filters. Blurring filters have the effect of smoothing the image, while sharpening filters increase the contrast of the image. Blurring and sharpening filters are available in image processing or computer vision libraries. In this document, we used the OpenCV* library to preprocess the test image.

The table below shows the timing results for the three experiments:

Test case

Loop

Baseline (Intel® Advanced Vector Extensions 2)

Speedup Factor with Vectorization (Intel® Advanced Vector Extensions 512)

Input image

LOOP 1

1

2.2

 

LOOP 2

1

7.0

Input image sharpened

LOOP 1

1

2.6

 

LOOP 2

1

7.4

Input image blurred

LOOP 1

1

1.7

 

LOOP 2

1

5.6

Looking at the results above, three questions arise:

  1. Why the compiler when using the Intel AVX512 flag is vectorizing the code, and when using the Intel AVX2 flag is not?
  2. If the code in LOOP 1 using the Intel AVX512 ISA is indeed vectorized, why is the improvement in performance relatively small compared to the theoretical speedup when using 512-bit vectors?
  3. Why does the performance gain of the vectorized code changes when the image is preprocessed? Specifically, why does the performance of the vectorized code increase when using a sharpened image, while it decreases when using a blurred image?

In the next section, the above questions will be answered based on a discussion about one of the subsets of the Intel AVX512 ISA, the Intel AVX512CD (conflict detection) subset.

The Intel AVX-512CD Subset

The Intel AVX-512CD (Conflict Detection) subset of the Intel AVX512 ISA adds functionality to detect data conflicts in the vector registers. In other words, it provides functionality to detect which elements in a vector operand are identical. The result of this detection is stored in mask vectors, which are used in the vector computations, so that the histogram operation (updating the histogram array) will be performed only on elements of the array (which represent pixel values in the image) that are different.

To further explore how the new instructions from the Intel AVX512CD subset work, it is possible to ask the compiler to generate an assembly code file by using the Intel C++ Compiler –S option:

icpc example2.cpp -o example2.s -O3 -xMIC-AVX512 –S …

The above command will create, instead of the executable file, a text file containing the assembly code for our C++ source code. Let’s take a look at part of the section of the assembly code that implements line 94 (the histogram update) in LOOP 1 in the example source code:

vcvttps2dq (%r9,%rax,4), %zmm5                        #94.19 c1
vpxord    %zmm2, %zmm2, %zmm2                         #94.8 c1
kmovw     %k1, %k2                                    #94.8 c1
vpconflictd %zmm5, %zmm3                              #94.8 c3
vpgatherdd (%r12,%zmm5,4), %zmm2{%k2}                 #94.8 c3
vptestmd  %zmm0, %zmm3, %k0                           #94.8 c5
kmovw     %k0, %r10d                                  #94.8 c9 stall 1
vpaddd    %zmm1, %zmm2, %zmm4                         #94.8 c9
testl     %r10d, %r10d                                #94.8 c11
je        ..B1.165      # Prob 30%                    #94.8 c13

In the above code fragment, vpconflictd detects conflicts in the source vector register (containing the pixel values) by comparing elements with each other in the vector, and writes the results of the comparison as a bit vector to the destination. This result is further tested to define which elements in the vector register will be used simultaneously for the histogram update, using a mask vector (The vpconflictd instruction is part of the Intel AVX-512CD subset, and the vptestmd instruction is part of the Intel AVX-512F subset. Specific information about these subsets can be found in the Intel AVX-512 ISA documentation (Intel, 2016) ). This process can be described with a diagram in Figures 2 and 3.

 Pixel values in array (smooth image).
Figure 2: Pixel values in array (smooth image).

 Pixel values in array (sharp image).
Figure 3: Pixel values in array (sharp image).

Figure 2 shows the case where some neighboring pixels in the image have the same value. Only the elements in the vector register that have different values in the array image1 will be used to simultaneously update the histogram. In other words, only the elements that will not create a conflict will be used to simultaneously update the histogram. The elements in conflict will still be used to update the histogram, but at a different time.

In this case, the performance will vary depending on how smooth the image is. The worst case scenario would be when all the elements in the vector register are the same, which would decrease the performance considerably, not only because at the end the loop would be processed in scalar mode, but also because of the overhead introduced by the conflict detection and testing instructions.

Figure 3 shows the case where the image was sharpened. In this case it is more likely that neighboring pixels in the vector register will have different values. Most or all of the elements in the vector register will be used to update the histogram, thereby increasing the performance of the loop because more elements will be processed simultaneously in the vector register.

It is clear that the best performance will be obtained when all elements in the array are different. However, the best performance will still be less than the theoretical speedup (16x in this case), because of the overhead introduced by the conflict detection and testing instructions.

The above discussion can be used to get answers to the questions that arose in section 5.

Regarding the first question about why the compiler generates vectorized code when using the Intel AVX512 flag, the answer is that the Intel AVX512CD and Intel AVX512F subsets include new instructions to detect conflicts in the subsets of elements in the loop and to create conflict-free subsets that can be safely vectorized. The size of these subsets will be data dependent. Vectorization was not possible when using the Intel AVX2 flag because the Intel AVX2 ISA does not include conflict detection functionality.

The second question about the reduced performance (compared to the theoretical speedup) obtained with the vectorized code, can be answered considering that there is some overhead introduced when the conflict detection and testing instructions are executed. This penalty in performance is notorious in LOOP 1, where the only computation that takes place is the histogram update.

However, in LOOP 2, where extra work is performed (on top of the histogram update), the performance gain, relative to the baseline, increases. The compiler, using the Intel AVX512 flag, is resolving the dependency created by the histogram computation, increasing the total performance of the loop. In the Intel AVX2 case, the dependency in the histogram computation is preventing other computations in the loop (even if they are dependency-free) from running in vector mode. This is an important result of the use of the Intel AVX512CD subset. The compiler will now be able to generate vectorized code for more complex loops that include histogram-like dependencies, which possibly required code rewriting in order to be vectorized before Intel AVX-512.

To the third question, it should be noticed that the total performance of the vectorized loops becomes data-dependent when using the conflict detection mechanism. As it is shown in figures 2 and 3, the speedup when running in vector mode will depend on how many values in the vector register are not identical (conflict-free). Sharp or noisy images (in this case) are less likely to have similar/identical values in neighboring pixels, compared to a smooth/blurred image.

Conclusions

This article showed a simple example of a loop which, because of the possibility of memory conflicts, was not vectorized by the Intel C++ compiler using the Intel AVX2 (and earlier) instruction sets, but which is now vectorized when using the Intel AVX-512 ISA on an Intel Xeon Phi processor. In particular, the new functionality in the Intel AVX512CD and Intel AVX512F subsets (which is currently available in the Intel Xeon Phi processor and in future Intel Xeon processors) lets the compiler automatically generate vector code for this kind of application, with no changes to the code. However the performance of the vector code created this way will be in general less than an application running in full vector mode and will also be data dependent, because the compiler will vectorize this application using mask registers whose contents vary depending on how similar neighboring data is.

The intent in this document is to motivate the use of the new functionality in the Intel AVX-512CD and Intel AVX-512F subsets. In future documents, we will explore more possibilities for vectorization of complex loops by taking explicit control of the logic to update the mask vectors, with the purpose of increasing the efficiency of the vectorization.

References

Colfax International. (2015). "Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization." Retrieved from Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization: http://colfaxresearch.com/optimization-techniques-for-the-intel-mic-architecture-part-2-of-3-strip-mining-for-vectorization/

Intel. (2016, February). "Intel® Architecture Instruction Set Extensions Programming Reference." Retrieved from https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Wikipedia. (n.d.). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Image_histogram

Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."

Go-To-Market Strategies for Your Small Business B2B App

$
0
0

In previous articles, we’ve discussed go-to-market strategies for selling your app to consumers, but as you consider the B2B market, how would your go-to-market plan need to differ? Since B2B means Business to Business, the key difference is that you’re selling your app directly to a business, or a person representing a business, rather than selling it to a person who only represents themselves. That’s going to change the way your customer makes decisions, and how you reach them. In this article, we’ll look specifically at go-to-market strategies for B2B apps.

 

Imagine that you’ve created an app to help dentists with scheduling reminders. It automatically generates reminder emails and texts, enables quick patient confirmation, and even allows the office to include additional promotional materials as needed. You can’t just post your app somewhere and hope they’ll find it and download it, and you can’t just market it like you would a game or a utility app—you’ll need to reach out to dental practices, in the places where they’re most likely to listen, and demonstrate that your product can help them to run their business better.

Know Your Customer—Their Responsibilities and Their Journey

In this example, your product is very specific to dental offices, so your primary target will likely be the dentist or an office manager working closely with the dentist. The general principles involved in knowing your customer—defining your audience, picking your channels, and customer acquisition—are mostly the same as they would be with a consumer app, but your B2B customer is more complicated because they have to represent their company, and the company’s interests, beyond their own. They also have a different journey than an individual consumer would have—with more external considerations and more focus on hard numbers. With a consumer app, you may just need to pique someone’s interest in a fun-sounding game, but with a B2B app, you have to understand how the app improves their bottom line or fits into their business plan.

Here are some questions to answer about your target customer:

  • Know the industry - 
     
    • Are there particular times of the year that will affect their interest or ability to implement new software?
       
    • Are there industry-specific processes that you should know in order to address their needs?
       
    • Are there conferences or regular industry events that would be a good place to introduce your product?
       
    • How do they usually make decisions about this aspect of the business (for example, patient communication or scheduling)?
       
    • Are there any relevant service providers that might be interested in distributing your app?
  • Know the benefits - 
     
    • What pain points does this business have?
       
    • How can your app address those pain points?
       
    • How will your app help them increase sales/reduce cost/improve retention?

ROI Is King

One key thing to remember is that business consumers are extremely interested in the return on investment, or ROI. Your app needs to solve a pain point in order to be worth their time and money, and you’ll need to be able to communicate that clearly to the business. For example, the appointment reminder app could cut down on potentially lost revenue due to missed appointments, while also freeing up the office manager to work on other aspects of the business.

Relationships are Key

B2B apps tend to use a subscription model in which customers pay a monthly or annual fee to use your app within their business. This is great for your bottom line—but because the cost is higher, and because integrating your tool will likely result in procedural changes within the business, the sales effort is also likely to be longer. All this is to say, relationships are a really important part of marketing and selling B2B apps. Business customers expect there to be ongoing support and communication, and you simply have to be able to talk to people and maintain long-term relationships for this business model to work. If the dental office signs up for a one-year contract, you might plan on quarterly updates, and be available to hear feedback and provide support.

Where Can You Find Them?

Finding the audience for a B2B app will really depend on the particular market or business you’re trying to serve, but here are a few ideas to get you started:

  • Industry events/continuing education. It’s a good idea to be wherever members of your target audience will be, like the annual ADA convention, and it’s even better if you can find events that are directly tied to the pain points your product addresses, like office systems management courses geared toward dental offices. Consider a table at a conference, a banner ad on an online course, presenting at a conference, or buying ad space in a catalog.
     
  • Technology service providers/resellers. Some small businesses would prefer to hire a technology service provider to make sure all of their systems are working and up to date—and your app might be something they can include in their offering. A service provider who works with multiple dental offices would be able to sell and distribute your app to multiple customers at once.
     
  • PR. Pitch your business story to industry publications. If you’re able to get a write-up in a leading dental industry magazine, you’ll build name recognition and interest.
     
  • Videos and content. Create materials they can view on their own, and then contact you if they’re interested. Remember, they’re running a business and they're busy, so you want to make it as easy for them to learn about your product as possible.
     
  • Meetups and seminars for industry/new business owners. Beyond big industry events, look for local meet-ups and seminars for new business owners. Your local BBB or Chamber of Commerce can also be a great resource.

The Importance of Word of Mouth

We’ve already discussed the importance of relationships, but with B2B it’s also important to remember another kind of relationship—the one your clients have with one another. Word of mouth is essential, and it’s very likely that people within your targeted industry rely and trust each other to provide recommendations—and warnings—about products and apps on the market. You might want to give away samples or trials in order to get your app out there and earn good reviews. Start with a few dentists who might want to be early adopters, and offer them incentives for trying your product, and for spreading the word. You might even want to offer a specific referral program, where they can get a month free, or a discounted premium service. Reassurance from peers is important in every industry, and once people start talking about your app, it’s no longer unknown—and business owners will be more likely to try it.

Businesses are always looking to improve their efficiency and performance, so when you're marketing to a business—particularly a small business—make sure you keep those end goals in mind. How can your app solve their pain points? What will the benefits be? The increased vetting and focus on ROI might seem like a lot at first, when you're used to working on consumer apps, but building long-term relationships with targeted customers and developing high-value apps that really meet their needs can be a very satisfying path to success. 

Viewing all 1142 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>