Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1142 articles
Browse latest View live

Intel® Advisor 2017 Update 4: What’s New

$
0
0

 

We’re pleased to announce a new version of the Vectorization Assistant tool: Intel® Advisor 2017 Update 4. For details about download, terms and conditions, please refer to the Intel® Parallel Studio 2017 program site.

Below are highlights of the new functionality in Intel® Advisor 2017 Update 4:

Bug fixes

Get Intel® Advisor and more information

Visit the product site, where you can find videos and tutorials. Register for Intel® Parallel Studio XE 2017 to download the whole bundle, including Intel® Advisor 2017 update 4.


How to Use Cache Monitoring Technology in OpenStack*

$
0
0

Introduction

With an increasing number of workloads running simultaneously on a system, there is more pressure on shared resources such as the CPU, cache, network bandwidth, and memory. While this reduces workload performance, if one or more of the workloads is bursty in nature it also reduces performance determinism. An interfering workload is called a noisy neighbor, and for the purposes of this discussion a workload could be any software application, a container, or even a virtual machine (VM).

Intel® Resource Director Technology (Intel® RDT) provides hardware support to monitor and manage shared resources, such as the last level cache (LLC) (also called the L3 cache), and memory bandwidth. In conjunction with software support, starting with the operating system and going up the solution stack, this functionality is being made available to monitor and manage shared resources to isolate workloads and improve determinism. In particular, the cache monitoring technology (CMT) aspect of Intel RDT provides last-level cache usage information for a workload.

OpenStack* is an open source cloud operating system that controls datacenter resources, namely compute, storage, and networking. Users and administrators can access the resources through a web interface or RESTful API calls. For the purposes of this document, we assume that the reader has some knowledge of OpenStack, either as an operator/deployer, or as a developer.

Let us explore how to enable and use CMT, in the context of an OpenStack cloud, to detect cache-related workload interference and take remedial action(s).

Note 1: Readers of this article should have basic understanding of OpenStack and its deployment and configuration.

Note 2: All of the configurations and examples are based on the OpenStack Newton* release version (released in October 2016) and the Gnocchi* v3.0 release.

Enabling CMT in OpenStack*

To leverage CMT in OpenStack requires touching the Nova*, Ceilometer*, and optionally the Gnocchi and Aodh* projects. The Nova project concerns itself with scheduling and managing workloads on the compute hosts. Ceilometer and Gnocchi pertain to telemetry. The Ceilometer agent runs on the compute hosts, gathers configured items of telemetry, and pushes them out for storage and future retrieval. The actual telemetry data could be saved in Ceilometer’s own database or the Gnocchi time series database with indices. The latter is superior, in both storage efficiency and retrieval speed. OpenStack Aodh supports defining rule-action pairs, such as whether some telemetry crosses a threshold and, if so, whether to emit an alarm. Alarms in turn could trigger some kind of operator intervention.

Enabling CMT in Nova*

OpenStack Nova provides access to the compute resources via a RESTful API and a web dashboard. To enable the CMT feature in Nova, the following preconditions have to be met:

  • The compute node hardware must support the CMT feature. The following CPUs support CMT (but are not limited to): Intel® Xeon® processor E5 v3 and Intel Xeon processor E5 v4 families. Please verify that the CPU specification supports CMT.
  • The libvirt version installed on Nova compute nodes is version 2.0.0 or greater.
  • The hypervisor running on the Nova compute host is a kernel-based virtual machine.

If all of the above preconditions are satisfied, and Nova is currently running, edit the libvirt section of the Nova configuration file (by default it is /etc/nova/nova.conf):

[libvirt]
virt_type = kvm
enabled_perf_events = cmt

After saving the above modifications, restart the Nova compute service.;

Openstack-nova-compute is a service on each compute host.

On Ubuntu* and CentOS* 6.5 hosts, run the following commands to restart the Nova compute service:

# service openstack-nova-compute restart
# service openstack-nova-compute status

On CentOS 7 and Fedora* 20 hosts, run the following commands instead to restart the Nova compute service:

# systemctl restart openstack-nova-compute
# systemctl status openstack-nova-compute

Once Nova is restarted, any new VMs launched by Nova will have the CMT feature enabled.

If devstack is being used instead to install a fresh OpenStack environment, add the following to the devstack local.conf file:

[[post-config|$NOVA_CONF]]
[libvirt]
virt_type = kvm
enabled_perf_events = cmt, mbml, mbmt

After saving the above configuration, run devstack to start the installation.

Enabling CMT in Ceilometer*

Ceilometer is part of the OpenStack Telemetry project whose mission is to:

  • Reliably collect utilization data from each host and for the VMs running on those hosts.
  • Persist the data for subsequent retrieval and analysis.
  • Trigger actions when defined criteria are met.

To get the last-level cache usage of a running VM, Ceilometer must be installed, configured to collect the cpu_l3_cache metric, and be running. Ceilometer defaults to collecting the metric. The cpu_l3_cache metric is collected by the Ceilometer agent running on the compute host by periodically polling for VM utilization metrics on the host.

If devstack is being used to install Ceilometer along with other OpenStack services and components, add the following in the devstack local.conf file:

[[local|localrc]]
enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer
enable_plugin aodh git://git.openstack.org/openstack/aodh

After saving the above configuration, run devstack to start the installation. This will install Ceilometer as well as Aodh (OpenStack alarming service) in addition to other OpenStack services and components.

Storing the CMT Metrics

There are two options to save telemetry data; namely in Ceilometer’s own backend database or in Gnocchi’s (also a member of the OpenStack Telemetry project) database. Gnocchi provides a time-series database with a resource indexing service, which is vastly superior to the Ceilometer native storage with respect to performance at scale, better disk utilization, and faster data retrieval. We recommend installing Gnocchi and configuring storage with the same. To do so using devstack, modify the following devstack local.conf file as follows:

[[local|localrc]]
enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer
CEILOMETER_BACKEND=gnocchi
enable_plugin aodh git://git.openstack.org/openstack/aodh
enable_plugin gnocchi git://git.openstack.org/openstack/gnocchi

After saving the above configuration, run devstack to start the installation.

Refer to Gnocchi documentation for information on other Gnocchi installation methods.

After installing Gnocchi and Ceilometer, confirm that the following configuration settings are in place:

In the Ceilometer configuration file (by default it is /etc/ceilometer/ceilometer.conf), make sure the options are listed as follows:

[DEFAULT]
meter_dispatchers = gnocchi
[dispatcher_gnocchi]
filter_service_activity = False
archive_policy = low
url = <url to the Gnocchi API endpoint>

In the Gnocchi dispatcher configuration file (by default it is /etc/ceilometer/gnocchi_resources.yaml), make sure that the cpu_l3_cache metric is added into the resource type instance’s metrics list:

… …
  - resource_type: instance
    metrics:
      - 'instance'
      - 'memory'
      - 'memory.usage'
      - 'memory.resident'
      - 'vcpus'
      - 'cpu'
      - 'cpu_l3_cache'
… …

If any modifications are made to the above configuration files, you must restart the Ceilometer collector so that the new configurations take effect.

Verify Things are Working

To verify that all of the above are working, test as follows:

  1. Create a new VM.

    $ openstack server create --flavor m1.tiny --image cirros-0.3.4-x86_64-uec abc

  2. Confirm that the VM has been created successfully.

    $ openstack server list

    ID

    Name

    Status

    Networks

    Image Name

    7e38a89b-c829-4fb9-b44a-35090fbc0866

    abc

    ACTIVE

    private=10.0.0.3

    cirros-0.3.4-x86_64-uec

  3. Wait for some time to allow the Ceilometer agent to collect the cpu_l3_cache metrics. The wait time is determined by the related pipeline defined in the /etc/ceilometer/pipeline.yaml file.
  4. Check to see if the related metrics are collected and stored.
    1. If the metric is stored in Ceilometer’s own database backend, use the following command:

      ID

      Resource ID

      Name

      Type

      Volume

      Unit

      Timestamp

      f42e275a-b36a-11e6-96b2-525400e9f0eb

      7e38a89b-c829-4fb9-b44a-35090fbc0866

      cpu_l3_cache

      gauge

      270336.0

      B

      2016-12-08T23:57:37.535615

      8e872286-b369-11e6-96b2-525400e9f0eb

      7e38a89b-c829-4fb9-b44a-35090fbc0866

      cpu_l3_cache

      gauge

      450560.0

      B

      2016-12-08T23:47:37.505369

      28e57758-b368-11e6-96b2-525400e9f0eb

      7e38a89b-c829-4fb9-b44a-35090fbc0866

      cpu_l3_cache

      gauge

      270336.0

      B

      2016-12-08T23:37:37.536424

      …...

      …...

      …...

      …...

      …...

      …...

      …...

    2. However, if the metric is stored in Gnocchi, access it as follows:

      $ gnocchi measures show --resource-id 9184470a-594e-4a46-a124-fa3aaaf412dc cpu_l3_cache --aggregation mean

      Timestamp

      Granularity

      Value

      2016-12-09T00:00:00+00:00

      86400.0

      282350.933333

      2016-12-09T01:00:00+00:00

      3600.0

      216268.8

      2016-12-09T01:45:00+00:00

      300.0

      180224.0

      2016-12-09T01:55:00+00:00

      300.0

      180224.0

      … ...

      … ...

      … ...

Using CMT in OpenStack

A noisy neighbor in the OpenStack environment could be a VM consuming resources in a manner that adversely affects one or more different VMs on the same compute node. Whether because of a lack of knowledge of workload characteristics, appropriate information during Nova scheduling, or a change in the workload characteristics (because of a spike in usage or a virus or other), a noisy situation may occur on a host. The cloud admin might want to detect and take some action, such as live migrating the greedy workload or terminating it. The OpenStack Aodh project) enables detecting scenarios and alerting to their existence using condition-action pairs. An Aodh rule that monitors VM cache usage crossing some threshold would automate detecting of noisy neighbor scenarios.

Below, we illustrate setting up an Aodh rule to detect noisy neighbors. The actual rule depends upon whether the CMT telemetry data is stored. We first cover storage in the Ceilometer database and then in the Gnocchi time series database.

Metrics Stored in Ceilometer Database

Below, we define, using the Aodh command-line utility, a threshold CMT metrics rule:

$ aodh --debug alarm create --name cpu_l3_cache -t threshold --alarm-action "log://" --repeat-actions True --comparison-operator "gt" --threshold 180224 --meter-name cpu_l3_cache --period 600 --statistic avg

Field

Value

alarm_actions

[u'log://']

alarm_id

e3673d39-90ed-4455-80f1-fd7e06e1f2b8

comparison_operator

gt

description

Alarm when cpu_l3_cache is gt a avg of 180224 over 600 seconds

enabled

True

evaluation_periods

1

exclude_outliers

False

insufficient_data_actions

[]

meter_name

cpu_l3_cache

name

cpu_l3_cache

ok_actions

[]

period

600

project_id

f1730972dd484b94b3b943d93f3ee856

repeat_actions

True

query

 

severity

low

state

insufficient data

state_timestamp

2016-12-08T23:59:05.712994

statistic

avg

threshold

180224

time_constraints

[]

timestamp

2016-12-08T23:59:05.712994

type

threshold

user_id

cfcd1ea48a1046b192dbd3f5af11290e

This creates an alarm rule named cpu_l3_cache that is triggered if, and only if, within a sliding window of 10 minutes (600 seconds), the VM’s average cpu_l3_cache metric is greater than 180224. If the alarm is triggered, it will be logged in the Aodh alarm notifier agent’s log. Alternately, instead of just logging the alarm event, a notifier may be used to push a notification to one or more configured endpoints. For example, we could use the http notifier by providing "http://<endpoint ip>:<endpoint port>" as the alarm-action parameter.

Metrics Stored in Gnocchi*

If the metrics are stored in Gnocchi, an Aodh alarm could be created through a gnocchi_resources_threshold rule such as the following, using the Aodh command-line utility:

$ aodh --debug alarm create -t gnocchi_resources_threshold --name test1 --alarm-action "log://alarm" --repeat-actions True --metric cpu_l3_cache --threshold 100000 --resource-id 9184470a-594e-4a46-a124-fa3aaaf412dc --aggregation-method mean --resource-type instance --granularity 300 --comparison-operator 'gt'

Field

Value

aggregation_method

mean

alarm_actions

[u'log://alarm']

alarm_id

71f48ee1-b92f-4982-92e4-4c520649a8e0

comparison_operator

gt

description

gnocchi_resources_threshold alarm rule

enabled

True

evaluation_periods

1

granularity

300

insufficient_data_actions

[]

metric

cpu_l3_cache

name

test1

ok_actions

[]

period

600

project_id

543aa2e8e17449149d5c101c55675005

repeat_actions

True

resource_id

9184470a-594e-4a46-a124-fa3aaaf412dc

resource_type

instance

state

insufficient data

state_timestamp

2016-12-09T05:57:07.089530

threshold

100000

time_constraints

[]

timestamp

2016-12-09T05:57:07.089530

type

gnocchi_resources_threshold

user_id

ca859810b379425085756faf6fd04ded

This creates an alarm named test1 if, and only if, within a sliding 10-minute window (600 seconds), the VM 9184470a-594e-4a46-a124-fa3aaaf412dc registers an average cpu_l3_cache metric greater than 180224. If triggered, an alarm is logged to the Aodh alarm notifier agent’s log output. Instead of the command-line utility the Aodh RESTful API could be used to define alarms; refer to http://docs.openstack.org/developer/aodh/webapi/v2.html for details.

While Gnocchi v3.0 is limited in its resource querying capabilities in comprehending metric type and thresholds, such enhancements are expected in future releases.

More About Intel® Resource Director Technology (Intel® RDT)

The Intel RDT family comprises, beyond CMT, other monitoring and resource allocation technologies. Those that will soon be available are:

  • Cache Allocation Technology (CAT) enables allocation of cache to workloads, either in exclusive or shared mode, to ensure performance despite co-resident (running on the same host) workloads. For instance, more cache can be allocated to a high-priority task that has a larger working set or, conversely, restricting cache usage for a streaming application that has a lower priority so that it does not interfere with higher priority tasks.
  • Memory Bandwidth Monitoring (MBM), along the lines of CMT, provides memory usage information for workloads.
  • Code Data Prioritization (CDP) enables separate control over code and data placement in the last-level cache.

To learn more visit http://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html.

In conclusion, we hope the above provides you with adequate information to start using CMT in an OpenStack cloud to gain deeper insights into workload characteristics to positively influence performance.

Fast Computation of Adler32 Checksums

$
0
0

Abstract

Adler32 is a common checksum used for checking the integrity of data in applications such as zlib*, a popular compression library. It is designed to be fast to execute in software, but in this paper we present a method to compute it with significantly better performance than the previous implementations. We show how the vector processing capabilities of Intel® Architecture Processors can be exploited to efficiently compute the Adler32 checksum.

Introduction

The Adler32 checksum (https://en.wikipedia.org/wiki/Adler-32) is similar to the Fletcher checksum, but it is designed to catch certain differences that Fletcher is not able to catch. It is used, among other places, in the zlib data compression library (https://en.wikipedia.org/wiki/Zlib), a popular general-purpose compression library.

While scalar implementations of Adler32 can achieve reasonable performance, this paper presents a way to further improve the performance by using the vector processing feature of Intel processors. This is an extension of the method we used to speed up the Fletcher checksum as described in (https://software.intel.com/en-us/articles/fast-computation-of-fletcher-checksums).

Implementation

If the input stream is considered to be an array of bytes (data), the checksum essentially consists of two 16-bit words (A and B), and the checksum can be defined as:

for (i=0; i<end; i++) {
     A = (A + data[i]) % 65521;
     B = (B + A) % 65521;
}

Doing the modulo operation after every addition is expensive. A well-known way to speed this up is to do the addition using larger variables (for example, 32-bit dwords), and then to perform the modulo only when the variables are about to risk overflowing, for example:

for (i=0; i<5552; i++) {
     A = (A + data[i]);
     B = (B + A);
}
A = A % 65521;
B = B % 65521;

The reason that up to 5552 bytes can be processed before needing to do the modulo is that if A and B are initially 65520 and the data is all 0xFF (255), after processing 5552 bytes, B (the larger of the two) will be 0xFFFBC598. But if one processes 5553 such bytes, the result would be greater than 232.

Within that loop, the calculation looks the same as in Fletcher, so the same approach can be used to vectorize the calculation. In this case, the body of the main loop would be an unrolled version of:

     pmovzxbd xdata0, [data]    ; Loads byte data into dword lanes
     paddd xa, xdata0
     paddd xb, xa

One can see that this looks essentially identical to what one would do with scalar code, except that it is operating on vector registers and, depending on the hardware generation, could be processing 4, 8, or 16 bytes in parallel.

If “a[i]” represents the i’th lane of vector register “a” and N is the number of lanes, we can (as shown in the earlier paper) calculate the actual sums by:

The sums can be done using a series of horizontal adds (PHADDD), and the scaling can be done with PMULLD.

In pseudo-code, if the main loop is operating on eight lanes (either with eight lanes in one register or four lanes unrolled by a factor of two), this might look like:

While (size != 0) {
     s = min(size, 5222)
     end = data + s – 7
     while (data < end) {
           compute vector sum
           data += 8
     }
     end += 7
     if (0 == (s & 7)) {
           size -= s;
           reduce from vector to scalar sum
           compute modulo
           continue while loop
     }
     // process final 1…7 bytes
     Reduce from vector to scalar sum
     Do final adds in scalar loop
     Compute modulo
}

Performance

The following graph compares the cycles as a function of input buffer size for an optimized scalar implementation, and for both a Streaming SIMD Extension and Intel® Advanced Vector Extensions 2 (Intel® AVX2) based parallel version, as described in this paper.

One can clearly see that the vector versions have a significantly better performance than an optimized scalar one. This is true for all but the smallest buffers.

An Intel® Advanced Vector Extensions 512 version was not tested, but it should perform significantly faster than the Intel AVX2 version.

Versions of this code are in the process of being integrated and released as part of the Intel® Intelligent Storage Acceleration Library (https://software.intel.com/en-us/storage/ISA-L).

Conclusion

This paper illustrated a method for improved Adler32 checksum performance. By leveraging architectural features such as SIMD in the processors and combining innovative software techniques, large performance gains are possible.

Author

Jim Guilford is an architect in the Intel Data Center Group, specializing in software and hardware features relating to cryptography and compression.

Mathematical Concepts and Principles of Naive Bayes

$
0
0

Simplicity is the ultimate sophistication.
—Leonardo Da Vinci

Image of handwritten mathematical equations and relative annotation

With time, machine learning algorithms are becoming increasingly complex. This, in most cases, is increasing accuracy at the expense of higher training-time requirements. Fast-training algorithms that deliver decent accuracy are also available. These types of algorithms are generally based on simple mathematical concepts and principles. Today, we’ll have a look at a similar machine-learning classification algorithm, naive Bayes. It is an extremely simple, probabilistic classification algorithm which, astonishingly, achieves decent accuracy in many scenarios.

Naive Bayes Algorithm

 On a red background, top left quarter of a man&#039;s face stares out at viewer. A mathematical equation in white text is superimposed

In machine learning, naive Bayes classifiers are simple, probabilistic classifiers that use Bayes’ Theorem. Naive Bayes has strong (naive), independence assumptions between features. In simple terms, a naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a ball may be considered a soccer ball if it is hard, round, and about seven inches in diameter. Even if these features depend on each other or upon the existence of the other features, naive Bayes believes that all of these properties independently contribute to the probability that this ball is a soccer ball. This is why it is known as naive.

Naive Bayes models are easy to build. They are also very useful for very large datasets. Although, naive Bayes models are simple, they are known to outperform even the most highly sophisticated classification models. Because they also require a relatively short training time, they make a good alternative for use in classification problems.

Mathematics Behind Naive Bayes

Bayes Theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x), and P(x|c). Consider the following equation:

Image of a mathematical equation

Here,

  • P(c|x): posterior probability ofclass(c,target) givenpredictor(x,attributes). This represents the probability of c being true, provided x is true.
  • P(c): is the prior probability ofclass. This is the observed probability of class out of all the observations.
  • P(x|c): is the likelihood which is the probability ofpredictor-givenclass. This represents the probability of x being true, provided x is true.
  • P(x): is the prior probability ofpredictor. This is the observed probability of predictor out of all the observations.

Let’s better understand this with the help of a simple example. Consider a well-shuffled deck of playing cards. A card is picked from that deck at random. The objective is to find the probability of a King card, given that the card picked is red in color.

Here,

     P(King | Red Card) = ?

We’ll use,

     P(King | Red Card) = P(Red Card | King) x P(King) / P(Red Card)

So,

     P (Red Card | King) = Probability of getting a Red card given that the card chosen is King = 2 Red Kings / 4 Total Kings = ½

     P (King) = Probability that the chosen card is a King = 4 Kings / 52 Total Cards = 1 / 13

     (Red Card) = Probability that the chosen card is red = 13 Red cards / 52 Total Cards = 1/ 4

Hence, finding the posterior probability of randomly choosing a King given a Red card is:

     P (King | Red Card) = (1 / 2) x (1 / 13) / (1 / 4) = 2 / 13 or 0.153

Understanding Naive Bayes with an Example

Let’s understand naive Bayes with one more example—to predict the weather based on three predictors: humidity, temperature and wind speed. The training data is the following:

HumidityTemperature Wind Speed Weather
HumidHotFastSunny
HumidHotFastSunny
HumidHotSlowSunny
Not HumidColdFastSunny
Not HumidHotSlowRainy
Not HumidColdFastRainy
HumidHotSlowRainy
HumidColdSlowRainy

We’ll use naive Bayes to predict the weather for the following test observation:

Humidity %Temperature (C)Wind Speed (Km/h)Weather
HumidColdFast?

We have to determine which posterior is greater, sunny or rainy. For the classification Sunny, the posterior is given by:

     Posterior( Sunny) = (P(Sunny) x P(Humid / Sunny) x P(Cold / Sunny) x   P(Fast / Sunny)) / evidence

Similarly, for the classification Rainy, the posterior is given by:

     Posterior( Rainy) = (P(Rainy) x P(Humid / Rainy) x P(Cold / Rainy) x  P(Fast / Rainy)) / evidence

Where,

     evidence = [ P(Sunny) x p(Humid / Sunny) x p(Cold / Sunny) x P(Fast / Sunny) ] + [ (P(Rainy) x P(Humid / Rainy) x P(Cold / Rainy) x P(Fast / Rainy) ) ]

Here,

     P(Sunny) = 0.5
     P(Rainy) = 0.5
     P(Humid/ Sunny) = 0.75
     P(Cold/ Sunny) = 0.25
     P(Fast/ Sunny) = 0.75
     P(Humid/ Sunny) = 0.25
     P(Cold/ Sunny) = 0.75
     P(Fast/ Sunny) = 0.25

Therefore, evidence = 0.703 + 0.023 = 0.726.

     Posterior (Sunny) = 0.968
     Posterior (Rainy) = 0.032

Since the posterior numerator is greater in the Sunny case, we predict the sample is Sunny.

Applications of Naive Bayes

Naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

  • Recommendation System: Naive Bayes classifiers are used in various inferencing systems for making certain recommendations to users out of a list of possible options.
  • Real-Time Prediction: Naive Bayes is a fast algorithm, which makes it an ideal fit for making predictions in real time.
  • Multiclass Prediction: This algorithm is also well-known for its multiclass prediction feature. Here, we can predict the probability of multiple classes of the target variable.
  • Sentiment Analysis: Naive Bayes is used in sentiment analysis on social networking datasets like Twitter* and Facebook* to identify positive and negative customer sentiments.
  • Text Classification: Naive Bayes classifiers are frequently used in text classification and provide a high success rate, as compared to other algorithms.
  • Spam Filtering: Naive Bayes is widely used inspam filtering for identifying spam email.

Why is Naive Bayes so Efficient?

An interesting point about naive Bayes is that even when the independence assumption is violated and there are clear, known relationships between attributes, it works decently anyway. There are two reasons that make naive Bayes a very efficient algorithm for classification problems.

  1. Performance: The naive Bayes algorithm gives useful performances despite having correlated variables in the dataset, even though it has a basic assumption of independence among features. The reason for this is that in a given dataset, two attributes may depend on each other, but the dependence may distribute evenly in each of the classes. In this case, the conditional independence assumption of naive Bayes is violated, but it is still the optimal classifier. Further, what eventually affects the classification is the combination of dependencies among all attributes. If we just look at two attributes, there may exist strong dependence between them that affects the classification. When the dependencies among all attributes work together, however, they may cancel each other out and no longer affect the classification. Therefore, we argue that it is the distribution of dependencies among all attributes over classes that affects the classification of naive Bayes, not merely the dependencies themselves.
  2. Speed: The main cause for the fast speed of naive Bayes training is that it converges toward its asymptotic accuracy at a different rate than other methods, like logistic regression, support vector machines, and so on. Naive Bayes parameter estimates converge toward their asymptotic values in order of log(n) examples, where n is number of dimensions. In contrast, logistic regression parameter estimates converge more slowly, requiring order n examples. It is also observed that in several datasets logistic regression outperforms naive Bayes when many training examples are available in abundance, but naive Bayes outperforms logistic regression when training data is scarce.

Practical Applications of Naive Bayes: Email Classifier—Spam or Ham?

Let’s see a practical application of naive Bayes for classifying email as spam or ham. We will use sklearn.naive_bayes to train a spam classifier in Python*.

import os
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

The following example will be using the MultinomialNB operation.

Creating the readFiles function:

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)
            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message

Creating a function to help us create a dataFrame:

def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)
    return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('/…/SPAMORHAM /emails/spam/', 'spam'))
data = data.append(dataFrameFromDirectory('/…/SPAMORHAM/emails/ham/', 'ham'))

Let's have a look at that dataFrame:

data.head()

 

      class message

 

      /…/SPAMORHAM/emails/spam/00001.7848dde101aa985090474a91ec93fcf0 spam <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
      /…/SPAMORHAM/emails/spam/00002.d94f1b97e48ed3b553b3508d116e6a09 spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
      /…/SPAMORHAM/emails/spam/00003.2ee33bc6eacdb11f38d052c44819ba6c spam 1) Fight The Risk of Cancer!\n\nhttp://www.adc...
      /…/SPAMORHAM/emails/spam/00004.eac8de8d759b7e74154f142194282724 spam ##############################################...
      /…/SPAMORHAM/emails/spam/00005.57696a39d7d84318ce497886896bf90d spam I thought you might like these:\n\n1) Slim Dow...

Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call the fit() method:

vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
counts

      <3000x62964 sparse matrix of type '<type 'numpy.int64'>'

        with 429785 stored elements in Compressed Sparse Row format>

Now we are using MultinomialNB():

classifierModel = MultinomialNB()

      ## This is the target

      ## Class is the target

targets = data['class'].values

      ## Using counts to fit the model

classifierModel.fit(counts, targets)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

The classifierModel is ready. Now, let’s prepare sample email messages to see how the model works.
Email number 1 is Free Viagra now!!!, Email number 2 is A quick brown fox is not ready, and so on:

examples = ['Free Viagra now!!!',"A quick brown fox is not ready","Could you bring me the black coffee as well?","Hi Bob, how about a game of golf tomorrow, are you FREE?","Dude , what are you saying","I am FREE now, you can come","FREE FREE FREE Sex, I am FREE","CENTRAL BANK OF NIGERIA has 100 Million for you","I am not available today, meet Sunday?"]
example_counts = vectorizer.transform(examples)

Now we are using the classifierModel to predict:

predictions = classifierModel.predict(example_counts)

Let’s check the prediction for each email:

predictions

   array(['spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham'],
      dtype='|S4')

Therefore, the first email is spam, the second is ham, and so on.

End Notes

We hope you have gained a clear understanding of the mathematical concepts and principles of naive Bayes using this guide. It is an extremely simple algorithm, with oversimplified assumptions at times, that might not stand true in many real-world scenarios. In this article we explained why naive Bayes often produces decent results, despite these facts. We feel naive Bayes is a very good algorithm and its performance, despite its simplicity, is astonishing.

Exploring the HPS and FPGA onboard the Terasic DE10-Nano

$
0
0

Introduction:

The Terasic DE-10 Nano is a development kit that contains a Cyclone* V device.  The Cyclone V contains a Hard Processor System (HPS) and field-programmable gate array (FPGA) with a wealth of peripherals onboard for creating some interesting applications.  One of the most basic things to accomplish with this system is to control the LEDs that are physically connected to the FPGA.  This tutorial will discuss four different methods for controlling the LEDs using the command line, memory mapped IO, schematic, and Verilog HDL.  Whether you are an application developer, firmware engineer, hardware engineer, or enthusiast, there is a method suited for you.

 

Prerequisites:

There are a wealth of datasheets, user guides, tools, and other information available for the DE-10 Nano.  It is encouraged to review this documentation to get a deeper understanding of the system.  For this tutorial, please download and install the following first:

01:  Intel® Quartus® Prime Software Suite Lite Edition for Cyclone V - http://dl.altera.com/?edition=lite

02:  Install EDS with DS-5 - http://dl.altera.com/soceds/

03:  Install Putty (Windows* Users) - http://www.putty.org/

04:  Install WinScp (Windows Users) - https://winscp.net/eng/download.php

05:  Install the RNDIS Ethernet over USB driver (Windows Users) – See Terasic DE10 Nano Quick Start Guide Section 3

06:  Download  DE-10 Nano Resources - http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=205&No=1046&PartNo=4

 

Method #1 - Command Line:

Out of the box, the DE-10 Nano HPS is pre-configured to boot a Linux* image and the FPGA is pre-configured with the Golden Hardware Reference Design (GHRD).  This means you have a complete system and can get started exploring right away by simply applying power to the board.  The most basic method to control an LED using the DE-10 Nano HPS is with the file system.  This method lends itself well to the scripters out there that want to do something basic and work at the filesystem level.  This can be easily illustrated using the serial terminal.  To begin, perform the following:

01:  Connect a Mini-B USB cable from the host to the DE10 Nano USB UART (Right Side of board)

02:  Open a serial terminal program

03:  Connect using a 115200 baud rate

04:  Login as root, no password is needed

05:  Turn on the LED

               echo 1 >  /sys/class/leds/fpga_led0/brightness

06:  Turn off the LED

               echo 0 >  /sys/class/leds/fpga_led0/brightness

 

Method #2 – C Based Application:

Another method to control the LEDs using the HPS, is to go lower level and develop a Linux application that accesses the memory mapped regions in SDRAM exposed by the FPGA that control the LEDs.  The HPS can access this region using the Lightweight HPS2FPGA AXI Bridge (LWFPGASLAVES) that connects to the Parallel IO (LED_PIO) area.  The C based application will map this region into the user application space, toggle all 8 LEDs every 500ms for a few times, unmap the region, and exit.  We will develop the application using the Eclipse DS-5 IDE.  To begin, perform the following:

01:  Open Eclipse DS-5

02:  Create a New Project

               02a:  File->New->C Project->Hello World Project

               02b:  Enter a project Name

               02c:  Select GCC 4.x [arm-linux-gnueabihf) Toolchain

               02d:  Click Finish

03:  Delete the generated Hello World code

04:  Add Preprocessor Includes & Defines

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

#define ALT_LWFPGASLVS_OFST 0xFF200000
#define LED_PIO_BASE             0x3000
#define LED_PIO_SPAN            0x10

05: Add main() function

 int main(void)
 {
           unsigned long *led_pio;
           int fd;
           int i;

           fd = open("/dev/mem", (O_RDWR | O_SYNC));


           //Map LED_PIO Physical Address to Virtual Address Space
           led_pio = mmap( NULL, LED_PIO_SPAN, ( PROT_READ | PROT_WRITE ), MAP_SHARED, fd, (ALT_LWFPGASLVS_OFST + LED_PIO_BASE) );

           //Toggle all LEDs every 500ms a few times
           for (i=0; i < 10; i++)
           {
                *led_pio ^= 0xFF; //Bit0=LED0 … Bit7=LED7
                usleep(1000*500);
           }


           //Unmap
           munmap(led_pio, LED_PIO_SPAN);

           close(fd);
           return(0);
 }

   

06:  Build the Project

               Project->Build Project

07:  Test out the application

               06a:  Connect a micro USB cable from the host to the DE10 Nano USB OTG Port

               06b:  Use scp to transfer the application to the DE10 Nano at root@192.168.7.1

               06c:  Run the application from the serial terminal ./<applicationName>

 

Methods #3 & #4 - Schematic and Verilog HDL:

So far we have been controlling the LEDs connected to the FPGA from the HPS using the command line and a C based application.  Now let’s discuss controlling the LEDs directly using just the FPGA logic.  In this design, we will turn on/off 2 LEDs when a pushbutton is pressed/released respectively.  It should be noted that the pushbutton is pulled up and the LEDs are active high so an inverter is used to get the desired behavior when the button is pressed/released.  We will use a schematic based approach to create the logic to control the first LED and a Verilog HDL approach to create the similar logic to control the second LED.  We will create the design using the Intel® Quartus® Prime Software Suite Lite Edition software.  To begin the project, perform the following:

01:  Open Quartus Prime Lite

02:  Create a New Project

               02a:  File->New->New Quartus Prime Project->Next

               02b:  Enter Project Name->Next->Empty Project

               02c:  Click Next                      

               02c:  Name Filter->5CSEBA6U23I7->Finish

03:  Create a New Schematic File

               03a:  File->New->Block Diagram/Schematic File

04:  Add LED Output Pins

               04a:  Right Click->Insert Symbol->primitives->pin->output

               04b:  Right Click->Insert Symbol->primitives->pin->output

               04c:  Right Click on output pin->Properties->Name=led_0

               04d:  Right Click on output pin->Properties->Name=led_1

05:  Add Push Button Input Pin

               05a:  Right Click->Insert Symbol->primitives->pin->input

               05b:  Right Click on input pin->Properties->Name=pushButton

06:  Add Inverter

                06a:  Right Click->Insert Symbol->primitives->logic->not

07:  Connect Everything Up

               07a:  Connect pushButton to Inverter Input

               07b:  Connect Inverter Output to led_0

08:  Create a New Verilog HDL File

               08a:  File->New->Verilog HDL File

               08b:  Enter Verilog Code     

module inverter (

           input      in,
           output     out
     );

     assign out = !in;

endmodule 

    08c:  Save File

               08d:  Update Symbols

                              File->Create/Update->Create Symbols From Current File

               08e:  Add Verilog Module to Top Level Schematic

                              Right Click->Insert Symbol->inverter->Click Ok

               

               08f:  Connect pushButton Input to Inverter Input

               08g:  Connect Inverter Output to led1 Output Pin

               

              

09:  Assign Inputs and Outputs to Physical FPGA Pins

               09a:  Processing->Start->Start Analysis & Elaboration

               09b:  Assignments->Pin Planner

                              led0->Location=PIN_W15, I/O Standard=3.3-V LVTTL

                              led1->Location=PIN_AA24, I/O Standard=3.3-V LVTTL

                              pushButton->Location=PIN_AH17

10:  Compile Project

               10a:  Start->Start Compilation

11:  Program the FPGA

               11a:  Connect mini USB cable to JTAG USB Blaster port (near HDMI connector and Blue LED)

               11b:  Click Tools->Programmer

               11c:  Click Hardware Setup->Currently Selected Hardware->DE-SoC

               11d:  Click AutoDetect->5CSEBA6

               11e:  Click on 5CSEBA6U23 Chip Picture

               11f:  Click Change File-><project directory>\output_files\yourFile.sof

               11g:  Click Program/Configure checkbox

               11h:  Click Start Button

12:  Test out the design

               12a:  Push and hold the right pushbutton and led0 and led1 will turn on

12b:  Release the pushbutton and the LEDs will turn off

 

Summary:

The DE-10 Nano has a lot to offer the engineer from its capabilities, variety of tools, programming methods, and documentation.  This tutorial showed four different methods for controlling the LEDs that utilized the HPS and FPGA using the available tools.  You can take all of these concepts to the next level when you begin your next project with the DE-10 Nano.

 

About the Author:

Mike Rylee is a Software Engineer at Intel Corporation with a background in developing embedded systems and apps for Android*, Windows*, iOS*, and Mac*.  He currently works on Internet of Things projects.          

              

Notices:

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Intel RealSense are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others

**This sample source code is released under the Intel Sample Source Code License Agreement

© 2017 Intel Corporation.            

Intel Parallel Studio XE Evaluation Guide

$
0
0

System Requirements and Prerequisites

To ensure successful installation, please review the release notes and verify that your system has the capability, capacity and prerequisites to install the product.

Find the desired version

  1. Go to: Intel® Parallel Studio XE Try and Buy
  2. Select the OS you need and click Download FREE Trial>


     
  3. What is in each package?
    • Windows*: Intel Parallel Studio XE Cluster Edition for Windows* (C++ and Fortran)
    • Linux*: Intel Parallel Studio XE Cluster Edition for Linux* (C++ and Fortran)
    • OS X* C++: Intel Parallel Studio XE Composer Edition for C++ OS X*
    • OS X* Fortran: Intel Parallel Studio XE Composer Edition for Fortran OS X*
  4. Note: although you are offered the Cluster edition you will be able to download and install a smaller, customized package

Complete the evaluation request form

  1. You will be asked to supply your email address and some additional information.
  2. After submitting the form you will receive a Registration Email from Intel.
  3. Important: If you don’t find the email in your Inbox look for it in other folders, such as Promotions, Spam, etc.

Register for Priority Customer Support [Optional]

  1. Registering for Evaluation does not create a full Intel account as no login id nor password are required to evaluate the product.
  2. In the Registration Email you will find a link to create a full account.
  3. Why should you create a full account?
    • A full account will give you 30 days of Priority Customer Support
    • You will be able to log into Intel Registration Center and manage your licenses
    • This is a single sign-in account. It will enable you to access the Intel® Developer Zone, Online Support Center and other areas within Intel
  4. Note: If you already have an Intel account you don’t need to create another one.

Download

  1. Click the Download > link in the email to download the product.
  2. You can choose from two download options:
    • The Online Download option will launch the online installer. You will be able to install the product or create a customized package for later installation
    • The Offline Download option will download the complete package
      Note: This package is typically big. The main advantage of this option is to download a package for an installation on a different OS

Downloading an older version

If you wish to install and evaluate an older version of the product see: How do I get an older version of an Intel Software Development Product for instructions.

Using the online installer to download a customized package

  1. In the installer choose the option to download for later installation.
  2. Proceed to select the components to download or use the default configuration.
    Note: The full evaluation package is quite big as it encompasses the compilers, libraries, analyzers and cluster tools.
  3. You can use the package to install the product on your system or on another computer, as desired.

Install

  1. In the Online Installer choose the option Install to this computer.
  2. OR use a previously downloaded package to install the product.
  3. Proceed to select the components to install or choose the recommended settings.
  4. Important: You do not require a serial number in order to install and evaluate the product.
    • If you have not installed Intel Parallel System XE before, in the license activation screen, choose the “Evaluation” activation option.
    • If the Installer finds a compatible license file on your system it will recommend a “License” activation.
  5. Given the suite size the installation is expected to be quite lengthy. Please refrain from aborting mid-installation. If you cancel the installation, please let the rollback run its course.

Installing the product without Internet connection

If you install your product with no Internet connection you will need to use a “License” activation. If you don’t have a compatible license file already on your system you will need to create one. See: How do I get a license file for offline installation.

Installing the product on multiple systems

The product can be installed on multiple systems as defined by our EULA. If you work in a VM environment and need additional activation see: How do I release activation.

Start your evaluation

Once the installation is complete the “Getting Started” guide will open. If you are new to our products this is a good way to explore and get familiar with the compilers, libraries and tools. You can also find the guides on our site. See: Getting Started with Intel Parallel Studio XE.

Need support?

If you run into issues during installation or during the evaluation of the product, let us know. We want to hear from you and help you get the most out of your evaluation.

  1. Check out our FAQs.
  2. For peer questions and discussions, see our Developer Forums.
  3. To report issues and seek help, please file a ticket at the Online Service Center.
    Note: In order to get Priority Support make sure to register in the Intel Registration Center. Use the link in your email.

What’s next?

We, at the Intel, are continually working to improve your experience with our developer program. After your evaluation you will receive a Feedback Survey. We would greatly appreciate a few minutes of your time to provide us feedback on what we are doing well and how we can improve.

We hope you enjoyed your evaluation and would like to purchase one of Intel Software Development Products. Please see our Purchasing FAQ for additional information.

Intel® MPI Library 2018 Beta Release Notes for Linux* OS

$
0
0

Overview

Intel® MPI Library is a multi-fabric message passing library based on ANL* MPICH3* and OSU* MVAPICH2*.

Intel® MPI Library implements the Message Passing Interface, version 3.1 (MPI-3) specification. The library is thread-safe and provides the MPI standard compliant multi-threading support.

To receive technical support and updates, you need to register your product copy. See Technical Support below.

Product Contents

  • The Intel® MPI Library Runtime Environment (RTO) contains the tools you need to run programs including scalable process management system (Hydra), supporting utilities, and shared (.so) libraries.
  • The Intel® MPI Library Development Kit (SDK) includes all of the Runtime Environment components and compilation tools: compiler wrapper scripts (mpicc, mpiicc, etc.), include files and modules, static (.a) libraries, debug libraries, and test codes.

What's New

Intel® MPI Library 2018 Beta Update 1

  • Deprecated support for the IPM statistics format.

Intel® MPI Library 2018 Beta

  • Improved startup times for Hydra when using shm:ofi or shm:tmi.
  • Hard finalization is now the default.
  • The default fabric list is changed when Intel® Omni-Path Architecture is detected.
  • Removed support for the Intel® Xeon Phi™ coprocessor (code named Knights Corner).
  • Documentation is now online.

Intel® MPI Library 2017 Update 2

  • Added environment variables I_MPI_HARD_FINALIZE and I_MPI_MEMORY_SWAP_LOCK.

Intel® MPI Library 2017 Update 1

  • PMI-2 support for SLURM*, improved SLURM support by default.
  • Improved mini help and diagnostic messages, man1 pages for mpiexec.hydra, hydra_persist, and hydra_nameserver.
  • Deprecations:
    • Intel® Xeon Phi™ coprocessor (code named Knights Corner) support.
    • Cross-OS launches support.
    • DAPL, TMI, and OFA fabrics support.

Intel® MPI Library 2017

  • Support for the MPI-3.1 standard.
  • New topology-aware collective communication algorithms (I_MPI_ADJUST family).
  • Effective MCDRAM (NUMA memory) support. See the Developer Reference, section Tuning Reference > Memory Placement Policy Control for more information.
  • Controls for asynchronous progress thread pinning (I_MPI_ASYNC_PROGRESS).
  • Direct receive functionality for the OFI* fabric (I_MPI_OFI_DRECV).
  • PMI2 protocol support (I_MPI_PMI2).
  • New process startup method (I_MPI_HYDRA_PREFORK).
  • Startup improvements for the SLURM* job manager (I_MPI_SLURM_EXT).
  • New algorithm for MPI-IO collective read operation on the Lustre* file system (I_MPI_LUSTRE_STRIPE_AWARE).
  • Debian Almquist (dash) shell support in compiler wrapper scripts and mpitune.
  • Performance tuning for processors based on Intel® microarchitecture codenamed Broadwell and for Intel® Omni-Path Architecture (Intel® OPA).
  • Performance tuning for Intel® Xeon Phi™ Processor and Coprocessor (code named Knights Landing) and Intel® OPA.
  • OFI latency and message rate improvements.
  • OFI is now the default fabric for Intel® OPA and Intel® True Scale Fabric.
  • MPD process manager is removed.
  • Dedicated pvfs2 ADIO driver is disabled.
  • SSHM support is removed.
  • Support for the Intel® microarchitectures older than the generation codenamed Sandy Bridge is deprecated.
  • Documentation improvements.

Key Features

  • MPI-1, MPI-2.2 and MPI-3.1 specification conformance.
  • Support for Intel® Xeon Phi™ processors (formerly code named Knights Landing).
  • MPICH ABI compatibility.
  • Support for any combination of the following network fabrics:
    • Network fabrics supporting Intel® Omni-Path Architecture (Intel® OPA) devices, through either Tag Matching Interface (TMI) or OpenFabrics Interface* (OFI*).
    • Network fabrics with tag matching capabilities through Tag Matching Interface (TMI), such as Intel® True Scale Fabric, Infiniband*, Myrinet* and other interconnects.
    • Native InfiniBand* interface through OFED* verbs provided by Open Fabrics Alliance* (OFA*).
    • Open Fabrics Interface* (OFI*).
    • RDMA-capable network fabrics through DAPL*, such as InfiniBand* and Myrinet*.
    • Sockets, for example, TCP/IP over Ethernet*, Gigabit Ethernet*, and other interconnects.
  • Support for the following MPI communication modes related to Intel® Xeon Phi™ coprocessor:
    • Communication inside the Intel Xeon Phi coprocessor.
    • Communication between the Intel Xeon Phi coprocessor and the host CPU inside one node.
    • Communication between the Intel Xeon Phi coprocessors inside one node.
    • Communication between the Intel Xeon Phi coprocessors and host CPU between several nodes.
  • (SDK only) Support for Intel® 64 architecture and Intel® MIC Architecture clusters using:
    • Intel® C++/Fortran Compiler 14.0 and newer.
    • GNU* C, C++ and Fortran 95 compilers.
  • (SDK only) C, C++, Fortran 77, Fortran 90, and Fortran 2008 language bindings.
  • (SDK only) Dynamic or static linking.

System Requirements

Hardware Requirements

  • Systems based on the Intel® 64 architecture, in particular:
    • Intel® Core™ processor family
    • Intel® Xeon® E5 v4 processor family recommended
    • Intel® Xeon® E7 v3 processor family recommended
    • 2nd Generation Intel® Xeon Phi™ Processor (formerly code named Knights Landing)
  • 1 GB of RAM per core (2 GB recommended)
  • 1 GB of free hard disk space

Software Requirements

  • Operating systems:
    • Red Hat* Enterprise Linux* 6, 7
    • Fedora* 23, 24
    • CentOS* 6, 7
    • SUSE* Linux Enterprise Server* 11, 12
    • Ubuntu* LTS 14.04, 16.04
    • Debian* 7, 8
  • (SDK only) Compilers:
    • GNU*: C, C++, Fortran 77 3.3 or newer, Fortran 95 4.4.0 or newer
    • Intel® C++/Fortran Compiler 15.0 or newer
  • Debuggers:
    • Rogue Wave* Software TotalView* 6.8 or newer
    • Allinea* DDT* 1.9.2 or newer
    • GNU* Debuggers 7.4 or newer
  • Batch systems:
    • Platform* LSF* 6.1 or newer
    • Altair* PBS Pro* 7.1 or newer
    • Torque* 1.2.0 or newer
    • Parallelnavi* NQS* V2.0L10 or newer
    • NetBatch* v6.x or newer
    • SLURM* 1.2.21 or newer
    • Univa* Grid Engine* 6.1 or newer
    • IBM* LoadLeveler* 4.1.1.5 or newer
    • Platform* Lava* 1.0
  • Recommended InfiniBand* software:
    • OpenFabrics* Enterprise Distribution (OFED*) 1.5.4.1 or newer
    • Intel® True Scale Fabric Host Channel Adapter Host Drivers & Software (OFED) v7.2.0 or newer
    • Mellanox* OFED* 1.5.3 or newer
  • Virtual environments:
    • Docker* 1.13.0
  • Additional software:
    • The memory placement functionality for NUMA nodes requires the libnuma.so library and numactl utility installed. numactl should include numactlnumactl-devel and numactl-libs.

Known Issues and Limitations

  • The I_MPI_JOB_FAST_STARTUP variable takes effect only when shm is selected as the intra-node fabric.
  • ILP64 is not supported by MPI modules for Fortran* 2008.
  • In case of program termination (like signal), remove trash in the /dev/shm/ directory manually with:
    rm -r /dev/shm/shm-col-space-*
  • In case of large number of simultaneously used communicators (more than 10,000) per node, it is recommended to increase the maximum numbers of memory mappings with one of the following methods:
    • echo 1048576 > /proc/sys/vm/max_map_count
    • sysctl -w vm.max_map_count=1048576
    • disable shared memory collectives by setting the variable: I_MPI_COLL_INTRANODE=pt2pt
  • On some Linux* distributions Intel® MPI Library may fail for non-root users due to security limitations. This was observed on Ubuntu* 12.04, and could impact other distributions and versions as well. Two workarounds exist:
    • Enable ptrace for non-root users with:
      echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
    • Revert the Intel® MPI Library to an earlier shared memory mechanism, which is not impacted, by setting: I_MPI_SHM_LMT=shm
  • Ubuntu* does not allow attaching a debugger to a non-child process. In order to use -gdb, this behavior must be disabled by setting the sysctl value in /proc/sys/kernel/yama/ptrace_scope to 0.
  • Cross-OS runs using ssh from a Windows* host fail. Two workarounds exist:
    • Create a symlink on the Linux* host that looks identical to the Windows* path to pmi_proxy.
    • Start hydra_persist on the Linux* host in the background (hydra_persist &) and use -bootstrap service from the Windows* host. This requires that the Hydra service also be installed and started on the Windows* host.
  • The OFA fabric and certain DAPL providers may not work or provide worthwhile performance with the Intel® Omni-Path Fabric. For better performance, try choosing the OFI or TMI fabric.
  • Enabling statistics gathering may result in increased time in MPI_Finalize.
  • In systems where some nodes have only Intel® True Scale Fabric or Intel® Omni-Path Fabric available, while others have both Intel® True Scale and e.g. Mellanox* HCAs, automatic fabric detection will lead to a hang or failure, as the first type of nodes will select ofi/tmi, and the second type will select dapl as the internode fabric. To avoid this, explicitly specify a fabric that is available on all the nodes.
  • In order to run a mixed OS job (Linux* and Windows*), all binaries must link to the same single or multithreaded MPI library.  The single- and multithreaded libraries are incompatible with each other and should not be mixed. Note that the pre-compiled binaries for the Intel® MPI Benchmarks are inconsistent (Linux* version links to multithreaded, Windows* version links to single threaded) and as such, at least one must be rebuilt to match the other.
  • Intel® MPI Library does not support using the OFA fabric over an Intel® Symmetric Communications Interface (Intel® SCI) adapter. If you are using an Intel SCI adapter, such as with Intel® Many Integrated Core Architecture, you will need to select a different fabric.
  • The TMI and OFI fabrics over PSM do not support messages larger than 232 - 1 bytes. If you have messages larger than this limit, select a different fabric.
  • If a communication between two existing MPI applications is established using the process attachment mechanism, the library does not control whether the same fabric has been selected for each application. This situation may cause unexpected applications behavior. Set the I_MPI_FABRICS variable to the same values for each application to avoid this issue.
  • Do not load thread-safe libraries through dlopen(3).
  • Certain DAPL providers may not function properly if your application uses system(3), fork(2), vfork(2), or clone(2) system calls. Do not use these system calls or functions based upon them. For example, system(3), with OFED* DAPL provider with Linux* kernel version earlier than official version 2.6.16. Set the RDMAV_FORK_SAFE environment variable to enable the OFED workaround with compatible kernel version.
  • MPI_Mprobe, MPI_Improbe, and MPI_Cancel are not supported by the TMI and OFI fabrics.
  • You may get an error message at the end of a checkpoint-restart enabled application, if some of the application processes exit in the middle of taking a checkpoint image. Such an error does not impact the application and can be ignored. To avoid this error, set a larger number than before for the -checkpoint-interval option. The error message may look as follows:
    [proxy:0:0@hostname] HYDT_ckpoint_blcr_checkpoint (./tools/ckpoint/blcr/
    ckpoint_blcr.c:313): cr_poll_checkpoint failed: No such process
    [proxy:0:0@hostname] ckpoint_thread (./tools/ckpoint/ckpoint.c:559):
    blcr checkpoint returned error
    [proxy:0:0@hostname] HYDT_ckpoint_finalize (./tools/ckpoint/ckpoint.c:878)
     : Error in checkpoint thread 0x7
  • Intel® MPI Library requires the presence of the /dev/shm device in the system. To avoid failures related to the inability to create a shared memory segment, make sure the /dev/shm device is set up correctly.
  • Intel® MPI Library uses TCP sockets to pass stdin stream to the application. If you redirect a large file, the transfer can take long and cause the communication to hang on the remote side. To avoid this issue, pass large files to the application as command line options.
  • DAPL auto provider selection mechanism and improved NUMA support require dapl-2.0.37 or newer.
  • If you set I_MPI_SHM_LMT=direct, the setting has no effect if the Linux* kernel version is lower than 3.2.
  • When using the Linux boot parameter isolcpus with an Intel® Xeon Phi™ processor using default MPI settings, an application launch may fail. If possible, change or remove the isolcpus Linux boot parameter. If it is not possible, you can try setting I_MPI_PIN to off.
  • In some cases, collective calls over the OFA fabric may provide incorrect results. Try setting I_MPI_ADJUST_ALLGATHER to a value between 1 and 4 to resolve the issue.

Technical Support

Every purchase of an Intel® Software Development Product includes a year of support services, which provides priority customer support at our Online Support Service Center web site, http://www.intel.com/supporttickets.

In order to get support you need to register your product in the Intel® Registration Center. If your product is not registered, you will not receive priority support.

Intel® VTune™ Amplifier Disk I/O analysis with Intel® Optane Memory

$
0
0

This article will talk about Intel® VTune™ Amplifier I/O Analysis with Intel® Optane Memory. Several benchmark tools like crystaldisk, IOmeter, System Mark or PC Mark etc. are used to evaluate system I/O efficiency with usually a score number. For some power users, PC-gaming geeks might be satisfied with those numbers served for performance validation purpose. How about the further technical-depth information like slow I/O activities identification, detailed I/O queue depth visualization in timeline, I/O function APIs callstacks and even the correlation with other system metrics to give further debug or profiling information for a software developer? Software Developers need the clues to understand how I/O efficient his program performs. VTune tries to provide such insights with its new feature, Disk I/O Analysis Type.
 

A bit about I/O Performance metrics

First of all, there are some basics you might need to know; I/O Queue Depth, Read/Write Latency, I/O Bandwidth, they are the I/O metrics used to track I/O efficiency. I/O Queue Depth means how many I/O commands wait in a queue to be served. This queue depth (size) depends on application, driver, OS implementation or the definition of host controller interface’s spec., like AHCI or NVMe and etc.. Comparing to ACHI with a single queue design, NVMe has multiple queues design supports parallel operations.

Imagine that a software program issues multiple I/O requests pass through the framework, software libraries, VM, container, runtimes, OS’s I/O scheduler, driver to the host controller of I/O device. These requests can be temporarily delayed in any of these components due to different queue implementation and other reasons. Observing the change of system’s queue depth can help understand how busy system I/O utilization is and overall I/O access patterns. From OS perspective, high queue depth represents a state that system is working to consume pending I/O requests. Zero queue depth means I/O scheduler is idle. From Storage device perspective, high queue depth design shows the storage media or controller has the confidence to serve a bulk of I/O requests in a higher speed comparing to lower queue depth design. Read/Write Latency shows how quick storage device finishes or response I/O request. Its inverse also represents IOPS (I/O per second). As for I/O Bandwidth, it will be tightened to the capability offered by different host controller interfaces. For example, SATA 3.0 can achieve 600MB/s of the theoretical bandwidth and NVMe PCIe 3.0 x2 lanes can do ~1.87GB/s.

 

Optane+NAND SSD

 

We will expect the system I/O performance increase after adopting Intel® Optane Memory + Intel Rapid Storage technique.

Insight from VTune for a workload running on Optane enabled setup

IOAPI_time_ssdvsoptane [figure1]

The figure 1 shows two VTune results are based on a benchmark program, PCMark, running on “single SATA NAND SSD” vs “SATA NAND SSD + additional 16GB NVMe Optane module within IRST RAID 0 mode”. Besides the basics of VTune’s online help for Disk I/O analysis, you can also observe I/O APIs effective time by applying “Task Domain” grouping view. As VTune indicates, I/O API’s CPU time also gets improved with Optane’s acceleration. It make senses since most of I/O API calls are synchronous in this case and I/O media with Optane acceleration responses quickly.

Latency SSD vs Optane

[figure 2]

In figure 2, it shows how VTune measure the latency for single I/O operation. We compare 3rd FileRead operation of the test#3(importing pictures to Windows Photo Gallery) of benchmark workload on both cases. It shows Optane+SSD can help nearly 5 times gain for this read operation speed in 300us vs 60us.

On linux target, VTune also provides the Page fault metric. Page fault event usually invokes disk I/O to handle page swapping. To avoid frequent Disk I/O caused by page fault events, the typical direction is to keep more pages in the memory instead swap pages back to the disk. Intel® Memory Drive Technology provides a solution to expand memory capacity and Optane provides the best proximity to memory’s speed. And that’s transparent to application and OS, it also mitigates the Disk I/O penalty to further increase the performance. One common mistake is that using asynchronous I/O can always help application’s I/O performance. Asynchronous I/O is to actually add more responsiveness back to the application because asynchronous I/O does not need to put CPU to wait. Putting CPU to wait is the case when synchronous I/O API is used but I/O operation is not finished.

With all that software design suggestions above, the extra performance solution is to upgrade your hardware to faster media. Intel® Optane is Intel’s edge non-volatile memory technology enabling memory-like performance at storage-like capacity and cost. VTune can even help to juice out more software performance by providing insight analysis.

See also

Intel® Optane™ Technology

Intel® Rapid Storage Technology

Check Intel® VTune™ Amplifer in Intel® System Studio

Intel® VTune™ Amplifier online help - Disk Input and Output Analysis

How to use Disk I/O analysis in Intel® VTune™ Amplifier for systems

Memory Performance in a Nutshell


Call for submissions: Intel HPC Developer Conference

$
0
0

Please consider giving a talk, tutorial or presenting a poster at this year's Intel HPC Developer Conference (November 11-12, 2017 - just before SC17 in Denver).

Submissions will be reviewed and responded to in a rolling fashion - so submit soon! (Best to submit by July 20, but okay until August 18.)

Submit online: https://intelhpcdc2017cfa.hubb.me (full information on dates, topics, etc. is on that web site).

The prior Intel HPC Developer Conferences have been very well rated by attendees - and that is due to the high quality of speakers (talks tutorials, panels, etc.) that we have enjoyed.  We are adding poster sessions this year to open up more discussions with attendees.

Technical talks of 30 minutes, Tutorials of 90, 120 or 180 minutes and Poster sessions submissions are encouraged.  Topics range include Parallel Programming, AI (ML/HPDA), High Productivity Languages, Visualization (esp. Software Defined Visualization and In Situ Visualization), Enterprise and Systems.

We expect to have another great conference this year - and we know that rests on the high quality presenters. We look forward to your submissions.  Feel free to drop me a note if you have any questions - or simply put in your proposal online, and put any questions in with your submission (we can talk!).

 

Use Intel® Optane™ Technology and Intel® 3D NAND SSDs to Build High-Performance Cloud Storage Solutions

$
0
0

Download Ceph configuration file  [1.9KB]

Introduction

As solid-state drives (SSDs) become more affordable, cloud providers are working to provide high-performance, highly reliable SSD-based storage for their customers. As one of the most open source scale-out storage solutions, Ceph faces increasing demand from customers who wish to use SSDs with Ceph to build high-performance storage solutions for their clouds.

The disruptive Intel® Optane™ Solid State Drive based on 3D XPoint™ technology fills the performance gap between DRAM and NAND-based SSDs. At the same time, Intel® 3D NAND TLC is reducing the cost gap between SSDs and traditional spindle hard drives, making all-flash storage an affordable option.

This article presents three Ceph all-flash storage system reference designs, and provides Ceph performance test results on the first Intel Optane and P4500 TLC NAND based all-flash cluster. This cluster delivers multi-million IOPS with extremely low latency as well as increased storage density with competitive dollar-per-gigabyte costs. Click on the link above for a Ceph configuration file with Ceph BlueStore tuning and optimization guidelines, including tuning for rocksdb to mitigate the impact of compaction.

What Motivates Red Hat Ceph* Storage All-Flash Array Development

Several motivations are driving the development of Ceph-based all-flash storage systems. Cloud storage providers (CSPs) are struggling to deliver performance at increasingly massive scale. A common scenario is to build an Amazon EBS-like service for an OpenStack*-based public/private cloud, leading many CSPs to adopt Ceph-based all-flash storage systems. Meanwhile, there is strong demand to run enterprise applications in the cloud. For example, customers are adapting OLTP workloads to run on Ceph when they migrate from traditional enterprise storage solutions. In addition to the major goal of leveraging the multi-purpose Ceph all-flash storage cluster to reduce TCO, performance is an important factor for these OLTP workloads. Moreover, with the steadily declining price of SSDs and efficiency-boosting technologies like deduplication and compression, an all-flash array is becoming increasingly acceptable.

Intel® Optane™ and 3D NAND Technology

Intel Optane technology provides an unparalleled combination of high throughput, low latency, high quality of service, and high endurance. It is a unique combination of 3D XPoint™ Memory Media, Intel Memory and Storage Controllers, Intel Interconnect IP and Intel® software1. Together these building blocks deliver a revolutionary leap forward in decreasing latency and accelerating systems for workloads demanding large capacity and fast storage.

Intel 3D NAND technology improves regular two-dimensional storage by stacking storage cells to increase capacity through higher density and lower cost per gigabyte, and offers the reliability, speed, and performance expected of solid-state memory3. It offers a cost-effective replacement for traditional hard-disk drives (HDDs) to help customers accelerate user experiences, improve the performance of apps and services across segments, and reduce IT costs.

Intel Ceph Storage Reference Architectures

Based on different usage cases and application characteristics, Intel has proposed three reference architectures (RAs) for Ceph-based all-flash arrays.

Standard configuration

Standard configuration is ideally suited for throughput optimized workloads that need high-capacity storage with good performance. We recommend using NVMe*/PCIe* SSD for journal and caching to achieve the best performance while balancing the cost. Table 1 describes the RA using 1x Intel® SSD DC P4600 Series as a journal or BlueStore* rocksdb write-ahead log (WAL) device, 12x up to 4 TB HDD for data, an Intel® Xeon® processor, and an Intel® Network Interface Card.

Example: 1x 1.6 TB Intel SSD DC P4600 as a journal, Intel® Cache Acceleration Software, 12 HDDs, Intel® Xeon® processor E5-2650 v4 .

Table 1. Standard configuration.

Ceph Storage Node  configuration – Standard

CPU

Intel® Xeon® processor E5-2650 v4

Memory

64 GB

NIC

Single 10Gb E, Intel® 82599 10 Gigabit Ethernet Controller or Intel® Ethernet Controller X550

Storage

Data: 12 x 4 TB HDD
Journal or WAL: 1x Intel® SSD DC P4600 1.6 TB
Caching: P4600

Caching Software

Intel® Cache Acceleration Software 3.0, option: Intel® Rapid Storage Technology enterprise/MD4.3; open source cache-like bcache/flashcache

TCO-Optimized Configuration

This configuration provides the best possible performance for workloads that need higher performance, especially for throughput, IOPS, and SLAs with medium storage capacity requirements, leveraging a mixed of NVMe and SATA SSDs.

Table 2. TCO-optimized configuration

Ceph Storage node –TCO Optimized

CPU

Intel® Xeon® processor E5-2690 v4

Memory

128 GB

NIC

Dual 10GbE (20 GB), Intel® 82599 10 Gigabit Ethernet Controller

Storage

Data: 4x Intel® SSD DC P4500 4, 8, or 16 TB or Intel DC SATA SSDs

Journal or WAL: 1x Intel® SSD DC P4600 Series 1.6 TB

IOPS-Optimized Configuration

The IOPS-optimized configuration provided best performance (throughput and latency) with Intel Optane Solid State Drives as Journal (FileStore) and WAL device (BlueStore) for a standalone Ceph cluster.

  • All NVMe/PCIe SSD Ceph system
  • Intel Optane Solid State Drive for FileStore Journal or BlueStore WAL
  • NVMe/PCIe SSD data, Intel Xeon processor, Intel® NICs
  • Example: 4x Intel SSD P4500 4, 8, or 16 TB for data, 1x Intel® Optane™ SSD DC P4800X 375 GB as journal (or WAL and database), Intel Xeon processor, Intel® NICs.

Table 3. IOPS optimized configuration

Ceph* Storage node –IOPS optimized

CPU

Intel® Xeon® processor E5-2699 v4

Memory

>= 128 GB

NIC

2x 40GbE (80 Gb), 4x Dual 10GbE (800 Gb), Intel® Ethernet Converged Network Adapter X710 family

Storage

Data: 4x Intel® SSD DC P4500 4, 8, or 16 TB

Journal or WAL : 1x Intel Optane SSD DC P4800X 375 GB

Notes

  • Journal: Ceph supports multiple storage back-end. The most popular one is FileStore, based on a filesystem (for example, XFS*) to store its data. In FileStore, Ceph OSDs use a journal for speed and consistency. Using SSD as a journal device will significantly improve Ceph cluster performance.
  • WAL: BlueStore is a new storage back-end designed to replace FileStore in the near future. It overcomes several limitations of XFS and POSIX* that exist in FileStore. BlueStore consumes raw partitions directly to store the data, but the metadata comes with an OSD, which will be stored in Rocksdb. Rocksdb uses a write-ahead log to ensure data consistency.
  • The RA is not a fixed configuration. We will continue to refresh it with latest Intel® products.

Ceph All-Flash Array performance

This section presents a performance evaluation of the IOPS-optimized configuration based on Ceph BlueStore.

System configuration

The test system described in Table 4 consisted of five Ceph storage servers, each fitted with two Intel® Xeon® processors E5-2699 v4 CPUs and 128 GB memory, plus 1x Intel® SSD DC P3700 2TB as a BlueStore WAL device, and 4x TB Intel® SSD DC P3520 2TB as a data drive. 1x Intel® Ethernet Converged Network Adapters X710 NIC 40 Gb NIC, two ports bonding together through bonding mode 6, used as separate cluster and public networks for Ceph, make up the system topology described in Figure 1. The test system also consisted of 5 client nodes, each fitted with two Intel Xeon processors E5-2699 v4, 64 GB memory, and 1x Intel Ethernet Converged Network Adapters X710 NIC 40 Gb NIC, two ports bonding together through bonding mode 6.

Ceph 12.0.0 (Luminous dev) was used, and each Intel SSD DC P3520 Series runs 4 OSD daemons. The rbd pool used for the testing was configured with 2 replica.

Table 4. System configuration.

Ceph Storage node – IOPS optimized

CPU

Intel® Xeon® processor E5-2699 v4 2.20 GHz

Memory

128 GB

NIC

1x 40 G Intel® Ethernet Converged Network Adapters X710, two ports bonding mode 6

Disks

1x Intel® SSD DC P3700 (2T) + 4x Intel® SSD DC P3520 2 TB

Software configuration

Ubuntu* 14.04, Ceph 12.0.0

Diagram of cluster topology
Figure 1. Cluster topology.

Testing methodology

To simulate a typical usage scenario, four test patterns were selected using fio with librbd. It consisted of 4K random read and write, and 64K sequential read and write. For each pattern, the throughput (IOPS or bandwidth) was measured as performance metrics with the number of volumes scaling; the volume size was 30 GB. To get stable performance, the volumes were pre-allocated to bypass the performance impact of thin-provisioning. OSD page cache was dropped before each run to eliminate page cache impact. For each test case, fio was configured with a 100 seconds warm up and 300 seconds data collection. Detailed fio testing parameters are included as part of the software configuration.

Performance overview

Table 5 shows a promising performance after tuning on this five-node cluster. 64K sequential read and write throughput is 5630 MB/s and 4200 MB/s respectively (maximums with the Intel Ethernet Converged Network Adapters X710 NIC in bonding mode 6). 4K random read throughput is 1312K IOPS with 1ms average latency, while 4 KB random write throughput is 331K IOPS with 4.8 ms average latency. The performance measured in the testing was roughly within expectations, except for a regression of 64K sequential write tests compared with previous Ceph releases, which requires further investigation and optimization.

Table 5. Performance overview.

Pattern

Throughput

Average Latency

64KB Sequential Write

4200 MB/s

18.9ms

64KB Sequential Read

5630 MB/s

17.7ms

4KB Random Write

331K IOPS

4.8ms

4KB Random Read

1312K IOPS

1.2ms

Scalability tests

Figures 2 to 5 show the graph of throughput for 4K random and 64K sequential workloads with different number of volumes, where each fio was running in the volume with a queue depth of 16.

Ceph demonstrated excellent 4K random read performance on the all-flash array reference architecture, as the total number of volumes increased from 1 to 100, the total 4K random read IOPS peaked around 1310 K IOPS, with an average latency around 1.2 ms. The total 4K random write IOPS peaked around 330K IOPS, with an average latency around 4.8 ms.

graphic of results for 4K Random read performance
Figure 2. 4K Random read performance.

graphic of results for 4K random write performance load line
Figure 3. 4K random write performance load line.

For 64K sequential read and write, as the total number of volumes increased from 1 to 100, the sequential read throughput peaked around 5630 MB/s, while sequential write peaked around 4200 MB/s. The sequential write throughput was lower than the previous Ceph release (11.0.2). It requires further investigation and optimization; stay tuned for further updates.

graphic of results for 64K sequential read throughput
Figure 4. 64K sequential read throughput

graphic of results for 64K sequential write throughput
Figure 5. 64K sequential write throughput

Latency Improvement with Intel® Optane™SSD

Fig 6 shows the latency comparison for 4K random write workloads with 1x Intel® SSD DC P3700 series 2.0 TB and 1x Intel Optane SSD DC P4800X series 375 GB drive as rocksdb & WAL device. The results proved with the Intel Optane SSD DC P4800X series 375 GB SSD as rocksdb and WAL drive in Ceph BlueData, the latency was significantly reduced:  a 226% reduction in 99.99% latency.

graphic of results for 4K random read and 4K random write latency comparison
Figure 6. 4K random read and 4K random write latency comparison

Summary

Ceph is one of most open source scale-out storage solutions, and there is growing interest among Cloud providers in building Ceph-based high-performance all-flash array storage solutions. We proposed three different reference architecture configurations targeting for different usage scenarios. The results for testing that simulated different workload pattern demonstrated that a Ceph all-flash system could deliver very high performance with excellent latency.

Software configuration

Fio configuration used for the testing

Take 4K random read for example.

[global]
    direct=1
    time_based
[fiorbd-randread-4k-qd16-30g-100-300-rbd]
    rw=randread
    bs=4k
    iodepth=16
    ramp_time=100
    runtime=300
    ioengine=rbd
    clientname=${RBDNAME}
    pool=${POOLNAME}
    rbdname=${RBDNAME}
    iodepth_batch_submit=1
    iodepth_batch_complete=1
    norandommap
  1. http://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html
  2. http://ceph.com
  3. http://www.intel.com/content/www/us/en/solid-state-drives/3d-nand-technology-animation.html

This sample source code is released under the Intel Sample Source Code License Agreement.

Before Salmi Games Can Make Bread, It Needs Some Jam

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Before Salmi Games can make bread, it needs some jam. Get more game dev news and related topics from https://venturebeat.com/category/intel-game-dev/Intel on VentureBeat.

 Glowing colorful geometric shapes moving against a black screen

Presented by Intel

Code jams have become incredibly popular, often gathering dozens, hundreds, or even thousands of programmers to innovate, collaborate, and compete in relatively quick coding endeavors. In a similar form, game jams have sprung up as a way for game makers to conceive and create a viable game, sometimes in as little as 24 hours.

The game jam concept was well known to Yacine Salmi and his collaborator Stefan Hell. In fact, it’s how they got to know each other. When they decided to come up with their own game ideas, they studied titles they liked and then held two-man brainstorming sessions that they treated like internal game jams.

Image of Stefan Hell and Yacine Salmi standing together in an outdoor area
Above: Stefan Hell (left) and Yacine Salmi of Salmi Games

 

What they generated from those exercises became the seed for the creation of a viable game-development studio called Salmi Games. Last year, the small studio released Ellipsis, an “avoid-’em-up” title that’s reminiscent of Geometry Wars. Initially released on mobile devices, this past January PC version was launched via Steam*.

At their fingertips

Salmi — who was born in America, but is currently living in Munich, Germany — started an umbrella company in 2013 that enabled him to make a living doing freelance coding and that funded his desire to make games on the side.

“I had previously worked in the game industry for 10 years, but this was my attempt at doing the indie life while paying my bills,” Salmi says. “I had previously done another indie company, but it didn’t go as well. I put all of my eggs in one basket and it just fell apart in the end. [This time], I wanted to build something sustainable.”

The first Salmi game jams were intended to explore how touch could be used to control a game. With the touch concept, it made sense to target mobile devices. But touch-controlled games had issues, and Salmi wanted to come up with a way around that.

 Glowing colorful geometric shapes moving against a black screen

“The main reason people don’t build touch-controlled games is because your finger or your hand tend to hide the action,” Salmi explains. “But we really liked the concept we came up with, so we decided to develop it further and work around the limitation presented by the player’s hand.

“We decided to build out levels that were very large and sparse, so you’d have time to see your objectives, see your enemies, and move around. It sort of became a dance with your hand and your fingers.”

The game did well, but Salmi says they hoped to bring it to PC, which was an uncommon path for game software. Not wanting to just port the game over to PC, they devoted time to “do a proper PC version.”

“We realized it would work with a mouse…but we didn’t just want to do a port,” Salmi explains. “We added content, we added a level editor, we redid all of our assets. We really tuned it for the PC…we made sure the game ran on every type of device, and that’s where the Intel optimization tools came in handy.”

The game not only did well on PC, it won Game of the Year, as well as Best Action Game, in the 2016 Intel® Level Up Game Developer Contest.

 Glowing colorful geometric shapes moving against a black screen

The next game could be a smash

Not content to sit back on Ellipsis’ success, Salmi and Hell are exploring new and ambitious product ideas. They’re now pursuing a game with the working title Late For Work, a virtual-reality (VR) game that plays like the old arcade game Rampage, where a King Kong-like gorilla scales skyscrapers and tries to smash them to smithereens, all the while avoiding the humans seeking to take him out.

An early concept includes a multiplayer mode that Salmi hopes will make it a “social VR game.” One person will wear the headset and play the gorilla, while the other two players use gamepads on a PC or jump in with their phones, taking over planes, tanks, and cars. Then they’ll all switch places, so everyone gets a chance to be the gorilla in VR gear.

With such a high-reaching concept, Salmi says that they’re looking at outside funding and considering bringing in more coders to help. Until recently, he and Hell used to work out of their respective Munich homes, but the pair recently moved into an office… and optimistically has gotten one with room for five.

“I’m hoping in the next three or four months we can ramp up… add an artist, add a programmer, and probably we’ll do some more outsourcing on the audio and the animation sides.”

Such are the woes of becoming successful and growing your company.

And all of that from a few two-man game jams.

Intel’s Virtual Reality Director Knows the Future (Hint: It’s Not About Headsets)

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Intel’s VR director knows the future (Hint: It’s not about headsets) Get more game dev news and related topics from Intel on VentureBeat.

 Shutterstock

Presented by Intel

Virtual reality (VR) is a big-buzz topic in gaming today, but a lot of questions remain about what direction the genre will take, and how it will evolve. Attendees of the upcoming GamesBeat Summit 2017 will get a highly educated perspective on VR’s future thanks to a presentation from Kim Pallister, director of the Intel® Virtual Reality Center of Excellence in Oregon.

The Virtual Reality Center is part of the Intel Client Computing Group, which, according to Pallister, “drives the business of selling Intel silicon and solutions into PCs — desktop and notebook PCs — the bread-and-butter business for us.” Within that overarching mission, Pallister and his team focus on how PCs will handle VR applications, and still provide users with the best performance possible.

“It’s up to us to understand what we need to be doing to these PCs over time; what we need to do to our roadmap, and to the PCs that come out, as VR becomes another usage for this very versatile platform,” Pallister explains.

“Part of our role is looking at how requirements are affected,” he continues. “Part of it is working with partners like Valve*, HTC* and Oculus*, along with others, on where their roadmaps are going — and making sure that we’re aligned. Similarly, we’re working with Microsoft* on preparing for the Windows* mixed-reality effort they have coming, and the PC-connected headsets that they’re helping their partners bring to market.”

Image of Kim Pallister
Above: Intel’s Kim Pallister Image Credit: Intel

 

Pallister adds that the Virtual Reality Center is “doing research and development on various technologies to help move the industry forward.” That effort comes in different forms: through best-practice software techniques; sample apps and methods for getting the most out of the CPU; and improving the user experience — such as how a VR headset could be used wirelessly, so the user doesn’t need to be tethered to the PC.

From fiction to real life

The theme for GamesBeat Summit 2017 is “How games, sci-fi, and tech create real-world magic.” Pallister’s talk will be geared toward how VR — as well as augmented reality (AR) and mixed-reality experiences — will change in the near future, based on how hardware and software will change. He’ll also address what game developers will need to do to stay on the bleeding edge of this swiftly evolving technology, which is still in its embryonic stages.

“A lot of the talk has been about this intersection between science fiction and where the VR industry is heading,” Pallister says, “So I’m looking at what the potential technologies are on the near-to-medium time horizon — not just from Intel, but from the industry at large — and what they might mean to the content and experiences that get developed there. I’m also looking at what some of the challenges will be in designing those experiences — in terms of game design, and how to steer the user experience. There’s a pretty rich vein of conversation that can be had there.

“Everybody in both the hardware and software spaces is learning as they iterate — there’s a lot of rapid evolution — and some of these technologies will take time for people to figure out how to wield those tools in effective ways.”

Who’s driving?

Pallister notes that the PC does particularly well when it’s the center of a fresh, rapidly evolving category, such as VR is now. This fast pace can create a chaotic situation at times, with numerous companies and individuals driving innovation, and trying to forge their way through this somewhat uncharted territory. Out of such chaos, however, can come a sense of order — and it’s order driven not by one self-designated, perhaps restrictive authority figure, with everyone else being forced to play “follow the leader”, but by discovery, progress, and a sense of community (even if there’s ultimately competition among those community members).

“Especially in an early space like VR, one of the advantages that the PC platform brings is that it’s an open ecosystem,” Pallister says. “In a space where nobody knows what the future holds, you’re far better off where lots of people can make different choices and different bets, and try different things, as opposed to having a single vendor that says: ‘We will decide what the future is, and you will all follow us.’ ”

But who will be the “lots of people” that Pallister says will push the VR Revolution?

“Not just Intel, but the players in the industry — including the vast majority of hardware and platforms players — all recognize that the developers are going to be the ones figuring out a lot of this stuff. And so the more we can give them flexibility, and give them tools to work with, the more they’re going to help guide us on this path.”

Managing Amazon Greengrass Core Devices Remotely with Wind River Helix* Device Cloud

$
0
0

IoT devices come in many flavors these days from generic gateways to specialized devices.  Using Intel® IoT Gateway Technology, Ubuntu* 16, and Wind River Helix* Device Cloud(HDC), remote management of your IoT system just became simple.  There are many cloud service providers to choose from these days.  Amazon has recently released a new IoT solution that supports Intel® IoT Gateway Technology  called Amazon Greengrass Core.  This tutorial will show you a method to restart your Amazon Greengrass Core Device remotely using HDC.

Prerequisites

  1. Install Ubuntu 16 (https://help.ubuntu.com/16.04/installation-guide/)
  2. Sign up for an Amazon Web Services (AWS)* Account (https://aws.amazon.com/)
  3. Install Amazon Greengrass Core (http://docs.aws.amazon.com/greengrass/latest/developerguide/gg-gs.html)
  4. Sign up for a HDC Trial Account (https://www.windriver.com/evaluations/)
  5. Download HDC Agent  (https://windshare.windriver.com/)
  6. Install HDC Agent (https://knowledge.windriver.com/en-us/000_Products/040/050/020/000_Wind_River_Helix_Device_Cloud_Getting_Started/060/000)

Tutorial

1.  Login into the HDC Portal at https://www.helixdevicecloud.com and Select the Device.

2. Remote login into Ubuntu 16 Device.

3. Stop and Start Amazon Greengrass Core.

Summary

This tutorial demonstrated how to restart an Amazon Greengrass Core Device remotely using Intel® IoT Gateway Technology and Helix Device Cloud.  Now it is easy to manage your IoT solutions after deployment.

About the Author

Mike Rylee is a Software Engineer at Intel Corporation with a background in developing embedded systems and apps for Android*, Windows*, iOS*, and Mac*.  He currently works on Internet of Things projects.

 

Development Strategy Turns Players into Robot Builders

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Freejam Games’ development strategy turns players into robot builders. Get more game dev news and related topics from Intel on VentureBeat.

 Freejam Games

Presented by Intel

Image of Mark Simmons CEO of Freejam Games
Above: Freejam’s CEO/Game Director Mark Simmons Image Credit: Freejam

 

A few years ago, Mark Simmons was toiling away at game development jobs at a work-for-hire contract studio. He enjoyed that he was working with friends, but he wasn’t excited by the restrictions imposed by such a situation: tight budgets, tough time schedules, and, most of all, that he was working on other people’s games.

On the side, he started working on a prototype that centered on the ability for players to contribute to the project via user-generated content (UGC). Inspired by Eric Ries’ book The Lean Startup, Simmons felt that a studio could be founded with just a few good people doing the main work, aided by UGC. He believed that UGC would provide “the means to allow a small developer to make games that were much larger than the small group could make on their own — to harness the power of the community.”

Simmons’ physics degree led him to put together a prototype of blocks — with inspiration from what Minecraft* accomplished — that could be placed together and would interact with the world properly. That prototype became a demo he played with his development friends, and an investor friend eventually got involved with a helping hand.

“[He] gave us this opportunity to build our own company,” Simmons says. “We all quit our jobs and formed this new company on the premise that if we weren’t successful in 18 months, it’d be dead.”

So, in April 2013, Freejam was born in Portsmouth, UK, with Simmons as the CEO/Game Director, and four developer friends making up the rest of the team. The basic prototype Simmons had put together provided the foundation for what they’d be working on going forward.

Share and share alike

From there, the group kept building onto the product and adding more functionality. Initially, there was the ability to connect blocks together, put wheels on the whole thing, and then drive it around a small area. Then the team added the ability to pick up green crystals that served as a form of currency, which could be used to buy more items.

Most studios work toward constructing a finished product before they endeavor to sell it to consumers, but Freejam’s intent was for players to create content that would make for a bigger game. That led them to release the prototype to the world to get feedback, and build up a community.

“We wanted to learn as much as we could learn,” Simmons says, “and we felt like we would learn more if we were bold and just put it out there in a raw form to develop it with the community.”

Screenshot of a battle bot being built in a 3d environment
Above: Building a battle bot Image Credit: Freejam

 

The concept, Simmons says, was to “build, measure, learn” by getting the game into people’s hands, analyzing the subsequent data, implementing new features, releasing the update to the community, analyzing the data… Lather, rinse, repeat. It wasn’t making money for them, as it was a free-to-play project, but, at the same time, Simmons says they didn’t have the strategy to build a complete product and expect it to be a blockbuster hit.

“We think it’s crazy to spend three years working on a title, and then launch it — and then hope that it’s good,” Simmons explains. “Obviously, for some developers that works really well, and there are some huge success stories with that approach. But as an indie, you haven’t got the brute-force money backing you to be able to make, say, an Overwatch, where it’s just so beautiful and so polished and so amazing that it’s just better than everything else.

“So, what you’ve got to do is innovate, and if you’re innovating, you’re trying to do something fundamentally different from everyone else. And there’s an inherent risk in trying to do something different from everyone else, because your idea may just suck and the audience may not go for it.”

Luckily, that wasn’t an issue. Simmons’s prototype became Robocraft, which itself became a much bigger, feature-filled shoot-’em-up. It’s still essentially a free-to-play product, but with in-app purchases — such as the ability to buy salvage crates to get more items, or a “membership” that brings some benefits. The benefits, however, won’t make you ultra-powerful, causing an imbalance among players in the community.

“We tried really hard to make sure the game is not pay-to-win in any way, and it’s fair on the monetization side, so the prices are honest, and, ultimately, anyone who’s playing the game for free can get everything within the game in a reasonable amount of time,” Simmons says. “We try to make sure it’s pretty fair.”

Make or break

Freejam is like any other developer in that it has faced — and continues to face — issues around producing its game. Simmons notes, for instance, that the varied, regularly changing PC specs are a constant challenge. It’s tough to make a game that’ll satisfactorily play on everyone’s computer.

 Variety of robots team up to shot at adversary in a 3d environment
Above: Teaming up in the third-person-shooter Robocraft Image Credit: Freejam

 

Also, while Robocraft’s ongoing iteration and revision means the game continues to grow (a good thing), sometimes a change that’s made doesn’t sit well with everyone in the game’s player base (a potentially bad thing).

“We’ve always been very open to changing the game if we feel a part of it is not working,” Simmons says. “Inevitably, you get some players that love the game the way it was, and where you’re constantly changing the game in fairly significant ways — and we’ve probably changed our game much more than most would after its launch — that comes with a certain amount of friction within the existing community. They get tired of the change, or resist the change.

“You get this constant tension. [On one hand, you have] new players who are coming at it for the first time — and [you’re seeing] it’s a better game, because they’re hanging around for longer and they’re telling more of their friends and leaving more positive reviews. On the other hand, you have this older group of users who’ve been playing it since Day One, and they remember a certain point in time, which was their favorite point-in-time with the development, and the change that’s been made isn’t a good one.”

It’s a battle that developers regularly need to fight: Do you add a new feature or alter the game for what you think is the better, at the risk of upsetting your existing base of long-time players? Or do you always cater to the veterans, running the risk that you might make it harder for newbies to engage with your game? Fortunately, Freejam — which has grown from its original five developers to a staff of 40 now — has 12-million registered players, amassed over the last three-plus years, to give the studio the vital feedback needed to make the right choices.

Image of the July 2016 Freejam team standing on a dock by the water
Above: The Freejam team has grown from its original five Image Credit: Freejam

Configure Open vSwitch* with Data Plane Development Kit on Ubuntu Server* 17.04

$
0
0

Overview

In this article, we will be configuring Open vSwitch* with Data Plane Development Kit (OVS-DPDK) on Ubuntu Server* 17.04. With the new release of this package, OVS-DPDK has been updated to use the latest release of both the DPDK (v16.11.1) and Open vSwitch (v2.6.1) projects. We took it for a test drive and were impressed with how seamless and easy it is to use OVS-DPDK on Ubuntu*.

We configured OVS-DPDK with two vhost-user ports and allocated them to two virtual machines (VMs). We then ran a simple iperf3* test case. The following diagram captures the setup.


Test-Case Configuration

Installing OVS-DPDK using Advanced Packaging Tool* (APT*)

To install OVS-DPDK on our system, run the following commands. Also, we will update ovs-vswitchd to use the ovs-vswitchd-dpdk package.

sudo apt-get install openvswitch-switch-dpdk
sudo update-alternatives --set ovs-vswitchd /usr/lib/openvswitch-switch
-dpdk/ovs-vswitchd-dpdk

Then restart the ovs-vswitchd service with the following command to use the DPDK:

sudo systemctl restart openvswitch-switch.service

Configuring Ubuntu Server* 17.04 for OVS-DPDK

The system we are using in this demo is a 2-socket, 22 cores per socket, Intel® Hyper-Threading Technology (Intel® HT Technology) enabled server, giving us 88 logical cores total. The CPU model used is an Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz. To configure Ubuntu for optimal use of OVS-DPDK, we will change the GRUB* command-line options that are passed to Ubuntu at boot time for our system. To do this we will edit the following config file:

/etc/default/grub

Change the setting GRUB_CMDLINE_LINUX_DEFAULT to the following:
 GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048 iommu=pt intel_iommu=on isolcpus=1-21,23-43,45-65,67-87"

This makes GRUB aware of the new options to pass to Ubuntu during boot time. We set isolcpus so that the Linux* scheduler would only run on two physical cores. Later, we will allocate the remaining cores to the DPDK. Also, we set the number of pages and page size for hugepages. For details on why hugepages are required, and how they can help to improve performance, please see the explanation in the Getting Started Guide for Linux on dpdk.org.

Note: The isolcpus setting varies depending on how many cores are available per CPU.

Also, we will edit /etc/dpdk/dpdk.conf to specify the number of hugepages to reserve on system boot. Uncomment and change the setting NR_1G_PAGES to the following:

NR_1G_PAGES=8

Depending on your system memory size, you may increase or decrease the number of 1G pages.

After both files have been updated run the following commands:

sudo update-grub
sudo reboot

A reboot will apply the new settings. Also during the boot enter the BIOS and enable:

- Intel® Virtualization Technology (Intel® VT-x)

- Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d)

Once logged back into your Ubuntu session we will create a mount path for our hugepages:

sudo mkdir -p /mnt/huge
sudo mkdir -p /mnt/huge_2mb
sudo mount -t hugetlbfs none /mnt/huge
sudo mount -t hugetlbfs none /mnt/huge_2mb -o pagesize=2MB
sudo mount -t hugetlbfs none /dev/hugepages

To ensure that the changes are in effect, run the commands below:

grep HugePages_ /proc/meminfo
cat /proc/cmdline

If the changes took place, your output from the above commands should look similar to the image below:

Configuring OVS-DPDK Settings

To initialize the ovs-vsctl database, a one-time step, we will run the command ‘sudo ovs-vsctl --no-wait init’. The OVS database will contain user set options for OVS and the DPDK. To pass in arguments to the DPDK we will use the command-line utility as follows:

‘sudo ovs-vsctl ovs-vsctl set Open_vSwitch . <argument>’.

Additionally, the OVS-DPDK package relies on the following config files:

    /etc/dpdk/dpdk.conf – Configures hugepages

    /etc/dpdk/interfaces – Configures/assigns network interface cards (NICs) for DPDK use

For more information on OVS-DPDK, unzip the following files:

  • /usr/share/doc/openvswitch-common/INSTALL.DPDK.md.gz
  • OVS DPDK install guide
  • /usr/share/doc/openvswitch-common/INSTALL.DPDK-ADVANCED.md.gz
  • Advanced OVS DPDK install guide

Next, we will configure OVS to use DPDK with the following command:

sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true

Once the OVS is set up to use DPDK, we will change one OVS setting, two important DPDK configuration settings, and bind our NIC devices to the DPDK.

DPDK Settings

  • dpdk-lcore-mask: Specifies the CPU cores on which dpdk lcore threads should be spawned. A hex string is expected.
  • dpdk-socket-mem: Comma-separated list of memory to preallocate from hugepages on specific sockets.

OVS Settings

  • pmd-cpu (poll mode drive-mask: PMD (poll-mode driver) threads can be created and pinned to CPU cores by explicitly specifying pmd-cpu-mask. These threads poll the DPDK devices for new packets instead of having the NIC driver send an interrupt when a new packet arrives.

The following commands are used to configure these settings:

sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=0xfffffbffffefffffbffffe
sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
sudo ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=1E0000000001E

For dpdk-lcore-mask we used a mask of 0xfffffbffffefffffbffffe to specify the CPU cores on which dpdk-lcore should spawn. In our system, we have the dpdk-lcore threads spawn on all cores except cores 0, 22, 44, and 66. Those cores are reserved for the Linux scheduler. Similarly, for the pmd-cpu-mask, we used the mask 1E0000000001E to spawn four pmd threads for non-uniform memory access (NUMA) Node 0, and another four pmd threads for NUMA Node 1. Lastly, since we have a two-socket system, we allocate 1 GB of memory per NUMA Node; that is, “1024, 1024”. For a single-socket system, the string would just be “1024”.

Creating OVS-DPDK Bridge and Ports

For our sample test case, we will create a bridge and add two DPDK vhost-user ports. To create an OVS bridge and two DPDK ports, run the following commands:

sudo ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
sudo ovs-vsctl add-port br0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser
sudo ovs-vsctl add-port br0 vhost-user2 -- set Interface vhost-user2 type=dpdkvhostuser

To ensure that the bridge and vhost-user ports have been properly set up and configured, run the command:

sudo ovs-vsctl show

If all is successful you should see output like the image below:

Binding Devices to DPDK

To bind your NIC device to the DPDK you must run the dpdk-devbind command. For example, to bind eth1 from the current driver and move to use vfio-pci driver, run:dpdk-devbind --bind=vfio-pci eth1.To use the vfio-pci driver, run modsprobe to load it and its dependencies.

This is what it looked like on my system, with 4 x 10 Gb interfaces available:

sudo modprobe vfio-pci
sudo dpdk-devbind --bind=vfio-pci ens785f0
sudo dpdk-devbind --bind=vfio-pci ens785f1
sudo dpdk-devbind --bind=vfio-pci ens785f2
sudo dpdk-devbind --bind=vfio-pci ens785f3

To check whether the NIC cards you specified are bound to the DPDK, run the command:

sudo dpdk-devbind --status

If all is correct, you should have an output similar to the image below:

Using DPDK vhost-user Ports with VMs

Creating VMs is out of the scope of this document. Once we have two VMs created (in this example, virtual disks us17_04vm1.qcow2 and us17_04vm2.qcow2), the following commands show how to use the DPDK vhost-user ports we created earlier.

Ensure that the QEMU* version on the system is v2.2.0 or above, as discussed under “DPDK vhost-user Prerequisites” in the OVS DPDK INSTALL GUIDE on https://github.com/openvswitch.

sudo qemu-system-x86_64 -m 1024 -smp 4 -cpu host -hda /home/user/us17_04vm1.qcow2 -boot c -enable-kvm -no-reboot -net none -nographic \
-chardev socket,id=char1,path=/run/openvswitch/vhost-user1 \
-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce \
-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 \
-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc \
-virtfs local,path=/home/user/iperf_debs,mount_tag=host0,security_model=none,id=vm1_dev
sudo qemu-system-x86_64 -m 1024 -smp 4 -cpu host -hda /home/user/us17_04vm2.qcow2 -boot c -enable-kvm -no-reboot -net none -nographic \
-chardev socket,id=char2,path=/run/openvswitch/vhost-user2 \
-netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce \
-device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 \
-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem -mem-prealloc \
-virtfs local,path=/home/user/iperf_debs,mount_tag=host0,security_model=none,id=vm2_dev \

DPDK vhost-user inter-VM Test Case with iperf3*

In the previous step, we configured two VMs, each with a Virtio* NIC that is connected to the OVS-DPDK bridge.

Configure the NIC IP address on both VMs to be on the same subnet. Install iperf3 from http://software.es.net/iperf, and then run a simple network test case. On one VM, start iperf3 in server mode iperf3 -s and run the iperf3 client on another VM, iperf3 –c server_ip. The network throughput and performance varies, depending on your system hardware capabilities and configuration.

OVS Using DPDK

OVS Without DPDK

From the above images, we observe that the OVS-DPDK transfer rate is roughly ~2.5x greater than OVS without DPDK.

Summary

Ubuntu has standard packages available for using OVS-DPDK. In this article, we discussed how to install, configure, and use this package for enhanced network throughput and performance. We also covered how to configure a simple OVS-DPDK bridge with DPDK vhost-user ports for an inter-VM application use case. Lastly, we observed that the OVS with DPDK gave us ~2.5x greater transfer rate than OVS without DPDK on a simple inter-vm test case on our system.

About the Author

Yaser Ahmed is a software engineer at Intel Corporation who has an MS degree in Applied Statistics from DePaul University and a BS degree in Electrical Engineering from the University of Minnesota.


CPUs are set to dominate high end visualization

$
0
0

 Carson Brownlee, Intel.  It is certainly provocative to say that CPUs will dominate any part of visualization - but I say it with confidence that the data supports why this is happening.  The primary drivers are (1) data sizes, (2) minimizing data movement, and (3) ability to change to O(n log n) algorithms.  Couple that with the ultra-hot topic of "Software Defined Visualization" that makes these three things possible - and you have a lot to consider about how the world is changing.

Of course, what is "high end" today often becomes common place over time... so this trend may affect us all eventually.  It's at least worth understanding the elements at play.

At ISC17, in Germany, this week (June 19-21) Intel is demoing (and selling) their vision of a “dream machine” for doing software defined visualization with a special eye towards in situ visualization development. Jim Jeffers, Intel, and friends are demonstrating it at ISC'17 in Germany, and they will be at SIGGRAPH'17 too. The "dream machine" can support visualization of data sets up to 1.5TB in size. They designed it to address the needs of the scientific visualization and professional rendering markets.

Photo credit (above): Asteroid Deep Water Impact Analysis; Data Courtesy: John Patchett, Galen Glisner per Los Alamos National Laboratory tech report LA-UR-17-21595. Visualization: Carson Brownlee, Intel.

With Jim's help, I wrote an article about how more information about how CPUs now offer higher performance and a lower cost than competing GPU-based solutions for the largest visualization tasks.  The full article is posted with coverage at TechEnablement site.

In the full article, aside from my writing about the trend - I do provide links to technical papers the show this trend towards CPUs as the preferred solution for visualization of large data (really really big), as well as links to conferences, and links about the "visualization dream machine" (how I describe it, not what Intel calls it officially).

Dream Machine for Software Defined Visualization

Photo: Intel/Colfex Visualization "Dream" Machine

Intel® Software Guard Extensions (Intel® SGX) Part 9: Power Events and Data Sealing

$
0
0

Download [ZIP 598KB]

In part 9 of the Intel® Software Guard Extensions (Intel® SGX) tutorial series we’ll address some of the complexities surrounding the suspend and resume power cycle. Our application needs to do more than just survive power transitions: it must also provide a smooth user experience without compromising overall security. First, we’ll discuss what happens to enclaves when the system resumes from the sleep state and provide general advice on how to manage power transitions in an Intel SGX application. We’ll examine the data sealing capabilities of Intel SGX and show how they can help smooth the transitions between power states, while also pointing out some of the serious pitfalls that can occur when they are used improperly. Finally, we’ll apply these techniques to the Tutorial Password Manager in order to create a smooth user experience.

You can find a list of all the published tutorials in the article Introducing the Intel® Software Guard Extensions Tutorial Series.

Source code is provided with this installment of the series.

Suspend, Hibernate, and Resume

Applications must be able to survive a sleep and resume cycle. When the system resumes from suspend or hibernation, applications should return to their previous state, or, if necessary, create a new state specifically to handle the wake event. What applications shouldn’t do is become unstable or crash as a direct result of that change in the power state. Call this the “rule zero” of managing power events.

Most applications don’t actually need special handling for these events. When the system suspends, the application state is preserved because RAM is still powered on. When the system hibernates, the RAM is saved to a special hibernation file on disk, which is used to restore the system state when it’s powered back on. You don’t need to add code to enable or take advantage of this core feature of the OS. There are two notable exceptions, however:

  • Applications that rely on physical hardware that isn’t guaranteed to be preserved across power events, such as CPU caches.
  • Scenarios where possible changes to the system context can affect program logic. For example, a location-based application can be moved hundreds of miles while it’s sleeping and would need to re-acquire its location. An application that works with sensitive data may choose to guard against theft by reprompting the user for his or her password.

Our Tutorial Password Manager actually falls into both categories. Certainly, if a laptop running our password manager is stolen, the thief would potentially have access to the victim’s passwords until they explicitly closed the application or locked the vault. The first category, though, may be less obvious: Intel SGX is a hardware feature that is not preserved across power events.

We can demonstrate this by running the Tutorial Password Manager, unlocking the vault, suspending the system, waking it back up, and then trying to read a password or edit one of the accounts. Follow those sequences, and you’ll get one of the error dialogs shown in Figure 1 or Figure 2.

Figure 1. Error received when attempting to edit an account after resuming from sleep.

Figure 2. Error received when attempting to view an account password after resuming from sleep.

As currently written, the Tutorial Password Manager violates rule zero: it becomes unstable after resuming from a sleep operation. The application needs special handling for power events.

Enclaves and Power Events

When a processor leaves S0 or S1 for a lower-power state, the enclave page cache (EPC) is destroyed: all EPC pages are erased along with their encryption keys. Since enclaves store their code and data in the EPC, when the EPC goes away the enclaves go with it. This means that enclaves do not survive power events that take the system to state S2 or lower.

Table 1 provides a summary of the power states.

Table 1. CPU power states

State

Description

S0

Active run state. The CPU is executing instructions, and background tasks are running even if the system appears idle and the display is powered off.

S1

Processor caches are flushed, CPU stops executing instructions. Power to CPU and RAM is maintained. Devices may or may not power off. This is a high-power standby state, sometimes called “power on suspend.”

S2

CPU is powered off. CPU context and contents of the system cache are lost.

S3

RAM is powered on to preserve its contents. A standby or sleep state.

S4

RAM is saved to nonvolatile storage in a hibernation file before powering off. When powered on, the hibernation file is read in to restore the system state. A hibernation state.

S5

“Soft off.” The system is off but some components are powered to allow a full system power-on via some external event, such as Wake-on-LAN, a system management component, or a connected device.

Power state S1 is not typically seen on modern systems, and state S2 is uncommon in general. Most CPUs go to power state S3 when put in “sleep” mode and drop to S4 when hibernating to disk.

The Windows* OS provides a mechanism for applications to subscribe to wakeup events, but that won’t help any ECALLs that are in progress when the power transition occurs (and, by extension, any OCALLs either since they are launched from inside of ECALLs). When the enclave is destroyed, the execution context for the ECALL is destroyed with it, any nested OCALLs and ECALLs are destroyed, and the outer-most ECALL immediately returns with a status of SGX_ERROR_ENCLAVE_LOST.

It is important to note that any OCALLs that are in progress are destroyed without warning, which means any changes they are making in unprotected memory will potentially be incomplete. Since unprotected memory is maintained or restored when resuming from the S3 and S4 power states, it is important that developers use reliable and robust procedures to prevent partial write corruptions. Applications must not end up in an indeterminate or invalid state when power resumes.

General Advice for Managing Power Transitions

Planning for power transitions begins before a sleep or hibernation event occurs. Decide how extensive the enclave recovery needs to be. Should the application be able to pick up exactly where it left off without user intervention? Will it resume interrupted tasks, restart them, or just abort? Will the user interface, if any, reflect the change in state? The answers to these questions will drive the rest of the application design. As a general rule, the more autonomous and seamless the recovery is, the more complex the program logic will need to be.

An application may also have different levels of recovery at different points. Some stages of an application may be easier to seamlessly recover from than others, and in some execution contexts it may not make sense or even be good security practice to attempt a seamless recovery at all.

Once the overall enclave recovery strategy has been identified, the process of preparing an enclave for a power event is as follows:

  1. Determine the minimal state information and data that needs to be saved in order to reconstruct the enclave.
  2. Periodically seal the state information and save it to unprotected memory (data sealing is discussed below). The sealed state data can be sent back to the main application as an [out] pointer parameter to an ECALL, or the ECALL can make an OCALL specifically to save state data.
  3. When an SGX_ERROR_ENCLAVE_LOST code is returned by an ECALL, explicitly destroy the enclave and then recreate it. It is strongly recommended that applications explicitly destroy the enclave with a call to sgx_enclave_destroy().
  4. Restore the enclave state using an ECALL that is designed to do so.

It is important to save the enclave state to untrusted memory before a power transition occurs. Even if the OS is able to send an event to an application when it is about to enter a standby mode, there are no guarantees that the application will have sufficient time to act before the system physically goes to sleep.

Data Sealing

When an enclave needs to preserve data across instantiations, either in preparation for a power event or between executions of the parent application, it needs to send that data out to untrusted memory. The problem with untrusted memory, however, is exactly that: it is untrusted. It is neither encrypted nor integrity checked, so any data sent outside the enclave in the clear is potentially leaking secrets. Furthermore, if that data were to be modified in untrusted memory, future instantiations of the enclave would not be able to detect that the modification occurred.

To address this problem, Intel SGX provides a capability called data sealing. When data is sealed, it is encrypted with advanced encryption standard (AES) in Galois/Counter Mode (GCM) using a 128-bit key that is derived from CPU-specific key material and some additional inputs, guided by one of two key policies. The use of AES-GCM provides both confidentiality of the data being sealed and integrity checking when the data is read back in and unsealed (decrypted).

As mentioned above, the key used in data sealing is derived from several inputs. The two key policies defined by data sealing determine what those inputs are:

  • MRSIGNER. The encryption key is derived from the CPU’s key material, the security version number (SVN), and the enclave signing key used by the developer. Data sealed using MRSIGNER can be unsealed by other enclaves on that same system that originate from the same software vendor (enclaves that share the same signing key). The use of an SVN allows enclaves to unseal data that was sealed by previous versions of an enclave, but prevents older enclaves from unsealing data from newer versions. It allows enclave developers to enforce software version upgrades.
  • MRENCLAVE. The encryption key is derived from the CPU’s key material and the enclave’s cryptographic signature. Data signed with the MRENCLAVE policy can only be unsealed by that exact enclave on that system.

Note that the CPU is a common component in the two key policies. Each processor has some random, hardware-based key material—physical circuitry on the processor—which is built into it as part of the manufacturing process. This ensures that data sealed by an enclave on one CPU cannot be unsealed by enclaves on another CPU. Each CPU will result in a different signing key, even if all other aspects of the signing policy (enclave measurement, enclave signing key, SVN) are the same.

The data sealing and unsealing API is really a set of convenience functions. They provide a high-level interface to the underlying AES-GCM encryption and 128-bit key derivation functions.

Once data has been sealed in the enclave, it can be sent out to untrusted memory and optionally written to disk.

Caveats

There is a caveat with data sealing, though, and it has significant security implications. Your enclave API needs to include an ECALL that will take sealed data as an input and then unseal it. However, Intel SGX does not authenticate the calling application, so you cannot assume that only your application is loading your enclave. This means that your enclave can be loaded and executed by anyone, even applications you didn’t write. As you might recall from Part 1, enclave applications are divided into two parts: the trusted part, which is made up of the enclaves, and the untrusted part, which is the rest of the application. These terms, “trusted” and “untrusted,” are chosen deliberately.

Intel SGX cannot authenticate the calling application because this would require a trusted execution chain that runs from system power-on all the way through boot, the OS load, and launching the application. This is far outside the scope of Intel SGX, which limits the trusted execution environment to just the enclaves themselves. Because there’s no way for the enclave to validate the caller, each enclave must be written defensibly. Your enclave cannot make any assumptions about the application that has called into it. An enclave must be written under the assumption that any application can load it and execute its API, and that its ECALLs can be executed in any order.

Normally this is not a significant constraint, but sealing and unsealing data complicates matters significantly because both the sealed data and the means to unseal it are exposed to arbitrary applications. The enclave API must not allow applications to use sealed data to bypass security mechanisms.

Take the following scenario as an example: A file encryption program wants to save end users the hassle of re-entering their password every time the application runs, so it seals their password using the data sealing functions and the MRENCLAVE policy, and then writes the sealed data to disk. When the application starts, it looks for the sealed data file, and if it’s present, reads it in and makes an ECALL to unseal the data and restore the user’s password into the enclave.

The problems with this hypothetical application are two-fold:

  • It assumes that it is the only application that will ever load the enclave.
  • It doesn’t authenticate the end user when the data is unsealed.

A malicious software developer can write their own application that loads the same enclave and follows the same procedure (looks for the sealed data file, and invokes the ECALL to unseal it inside the enclave). While the malicious application can’t expose the user’s password, it can use the enclave’s ECALLs to encrypt and decrypt the user’s files using their stored password, which is nearly as bad. The malicious user has gained the ability to decrypt files without having to know the user’s password at all!

A non-Intel SGX version of this same application that offered this same convenience feature would also be vulnerable, but that’s not the point. If the goal is to use Intel SGX features to harden the application’s security, those same features should not be undermined by poor programming practices!

Managing Power Transitions in the Tutorial Password Manager

Now that we understand how power events affect enclaves and know what tools are available to assist with the recovery process, we can turn our attention to the Tutorial Password Manager. As currently written, it has two problems:

  • It becomes unstable after a power event.
  • It assumes the password vault should remain unlocked after the system resumes.

Before we can solve the first problem we need to address the second one, and that means making some design decisions.

Sleep and Resume Behavior

The big decision that needs to be made for the Tutorial Password Manager is whether or not to lock the password vault when the system resumes from a sleep state.

The primary argument for locking the password vault after a sleep/resume cycle is to protect the password database in case the physical system is stolen while it’s suspended. This would prevent the thief from being able to access the password database after waking up the device. However, having the system lock the password vault immediately can also be a user interface friction: sometimes, aggressive power management settings cause a running system to sleep while the user is still in front of the device. If the user wakes the system back up immediately, they might be irritated to find that their password vault has been locked.

This issue really comes down to balancing user convenience against security, so the right approach is to give the user control over the application’s behavior. The default will be for the password vault to lock immediately upon suspend/resume, but the user can configure the application to wait up to 10 minutes after the sleep event before the vault is forcibly locked.

Intel® Software Guard Extensions and Non-Intel Software Guard Extensions Code Paths

Interestingly, the default behavior of the Intel SGX code path differs from that of the non-Intel SGX code path. Enclaves are destroyed during the sleep/resume cycle, which means that we effectively lock the password vault as a result. To give the user the illusion that the password vault never locked at all, we have to not only reload the vault file from disk, but also explicitly unlock it again without forcing the user to re-enter their password (this has some security implications, which we discuss below).

For the non-Intel SGX code path, the vault is just stored in regular memory. When the system resumes, system memory is unchanged and the application continues as normal. Thus, the default behavior is that an unlocked password vault remains unlocked when the system resumes.

Application Design

With the behavior of the application decided, we turn to the application design. Both code paths need to handle the sleep/resume cycle and place the vault in the correct state: locked or unlocked.

The Non-Intel Software Guard Extensions Code Path

This is the simpler of the two code paths. As mentioned above, the non-Intel SGX code path will, by default, leave the password vault unlocked if it was unlocked when the system went to sleep. When the system resumes it only needs to see how long it slept: if the sleep time exceeds the maximum configured by the user, the password vault should be explicitly locked.

To keep track of the sleep duration, we’ll need a periodic heartbeat that records the current time. This time will serve as the “sleep start” time when the system resumes. For security, the heartbeat time will be encrypted using the database key.

The Intel Software Guard Extensions Code Path

No matter how the application is configured, the system will need code to recreate the enclave and reopen the password vault. This will put the vault in the locked state.

The application will then need to see how long it has been sleeping. If the sleep time was less than the maximum configured by the user, the password vault needs to be explicitly unlocked without prompting the user for his or her master passphrase. In order to do that the application needs the passphrase, and that means the passphrase must be saved to untrusted memory so that it can be read back in when the system is restored.

The only safe way to save a secret to untrusted memory is to use data sealing, but this presents a significant security issue: As mentioned previously, our enclave can be loaded by any application, and the same ECALL that is used to unseal the master password will be available for anyone to use. Our password manager application exposes secrets to the end user (their passwords), and the master password is the only means of authenticating the user. The point of keeping the password vault unlocked after the sleep/resume cycle is to prevent the user from having to authenticate. That means we are creating a logic flow where a malicious user could potentially use our enclave’s API to unseal the user’s master password and then extract their account and password data.

In order to mitigate this risk, we’ll do the following:

  • Data will be sealed using the MRENCLAVE policy.
  • Sealed data will be kept in memory only. Writing it to disk would increase the attack surface.
  • In addition to sealing the password, we’ll also include the process ID. The enclave will require that the process ID of the calling process match the one that was saved when unsealing the data. If they don’t match, the vault will be left in the locked state.
  • The current system time will be sealed periodically using a heartbeat function. This will serve as the “sleep start” time.
  • The sleep duration will be checked in the enclave.

Note that verification logic must be in the enclave where it cannot be modified or manipulated.

This is not a perfect solution, but it helps. A malicious application would need to scrape the sealed data from memory, crash the user’s existing process, and then create new processes over and over until it gets one with the same process ID. It will have to do all of this before the lock timeout is reached (or take control of the system clock).

Common Needs

Both code paths will need some common infrastructure:

  • A timer to provide the heartbeat. We’ll use a timer interval of 15 seconds.
  • An event handler that is called when the system resumes from a sleep state.
  • Safe handling for any potential race conditions, since wakeup events are asynchronous.
  • Code that updates the UI to reflect the “locked” state of the password vault

Implementation

We won’t go over every change in the code base, but we’ll look at the major components and how they work.

User Options

The lock timeout value is set in the new Tools -> Options configuration dialog, shown in Figure 3.

Figure 3. Configuration options.

This parameter is saved immediately to the Windows registry under HKEY_LOCAL_USER and is loaded by the application on startup. If the registry value is not present, the lock timeout defaults to zero (lock the vault immediately after going to sleep).

The Intel SGX code path also saves this value in the enclave.

The Heartbeat

Figure 4 shows the declaration for the Heartbeat class which is ultimately responsible for recording the vault’s state information. The heartbeat is only run if state information is needed, however. If the user has set the lock timeout to zero, we don’t need to maintain state because we know to lock the vault immediately when the system resumes.

class PASSWORDMANAGERCORE_API Heartbeat {
	class PasswordManagerCoreNative *nmgr;
	HANDLE timer;
	void start_timer();
public:
	Heartbeat();
	~Heartbeat();
	void set_manager(PasswordManagerCoreNative *nmgr_in);
	void heartbeat();

	void start();
	void stop();
};

Figure 4. The Heartbeat class.

The PasswordManagerCoreNative class gains a Heartbeat object as a class member, and the Heartbeat object is initialized with a reference back to the containing PasswordManagerCoreNative object.

The Heartbeat class obtains a timer from CreateTimerQueueTimer and executes the callback function heartbeat_proc when the timer expires, as shown in Figure 5. The timer is sent a reference to the Heartbeat object, which in turn calls the heartbeat method in the Heartbeat class, which in turn calls the heartbeat method in PasswordManagerCoreNative and restarts the timer.

static void CALLBACK heartbeat_proc(PVOID param, BOOLEAN fired)
{
   // Call the heartbeat method in the Heartbeat object
	Heartbeat *hb = (Heartbeat *)param;
	hb->heartbeat();
}

Heartbeat::Heartbeat()
{
	timer = NULL;
}

Heartbeat::~Heartbeat()
{
	if (timer == NULL) DeleteTimerQueueTimer(NULL, &timer, NULL);
}

void Heartbeat::set_manager(PasswordManagerCoreNative *nmgr_in)
{
	nmgr = nmgr_in;

}

void Heartbeat::heartbeat ()
{
	// Call the heartbeat method in the native password manager
	// object. Restart the timer unless there was an error.

	if (nmgr->heartbeat()) start_timer();
}

void Heartbeat::start()
{
	stop();

	// Perform our first heartbeat right away.

	if (nmgr->heartbeat()) start_timer();
}

void Heartbeat::start_timer()
{
	// Set our heartbeat timer. Use the default Timer Queue

	CreateTimerQueueTimer(&timer, NULL, (WAITORTIMERCALLBACK)heartbeat_proc,
		(void *)this, HEARTBEAT_INTERVAL_SECS * 1000, 0, 0);
}

void Heartbeat::stop()
{
	// Stop the timer (if it exists)

	if (timer != NULL) {
		DeleteTimerQueueTimer(NULL, timer, NULL);
		timer = NULL;
	}
}

Figure 5. The Heartbeat class methods and timer callback function.

The heartbeat method in the PasswordManagerCoreNative object maintains the state information. To prevent partial write corruption, it has a two-element array of state data and an index pointer to the current index (0 or 1). The new state information is obtained from:

  • The new ECALL ve_heartbeat in the Intel SGX code path (by way of ew_heartbeat in EnclaveBridge.cpp).
  • The Vault method heartbeat in the non-Intel SGX code path.

After the new state has been received, it updates the next element (alternating between elements 0 and 1) of the array, and then updates the index pointer. The last operation is our atomic update, ensuring that the state information is complete before we officially mark it as the “current” state.

Intel Software Guard Extensions code path

The ve_heartbeat ECALL simply calls the heartbeat method in the E_Vault object, as shown in Figure 6.

int E_Vault::heartbeat(char *state_data, uint32_t sz)
{
	sgx_status_t status;
	vault_state_t vault_state;
	uint64_t ts;

	// Copy the db key

	memcpy(vault_state.db_key, db_key, 16);

	// To get the system time and PID we need to make an OCALL

	status = ve_o_process_info(&ts, &vault_state.pid);
	if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;

	vault_state.lastheartbeat = (sgx_time_t)ts;

	// Storing both the start and end times provides some
	// protection against clock manipulation. It's not perfect,
	// but it's better than nothing.

	vault_state.lockafter = vault_state.lastheartbeat + lock_delay;

	// Saves us an ECALL to have to reset this when the vault is restored.

	vault_state.lock_delay = lock_delay;

	// Seal our data with the MRENCLAVE policy. We defined our
	// struct as packed to support working on the address
	// directly like this.

	status = sgx_seal_data(0, NULL, sizeof(vault_state_t), (uint8_t *)&vault_state, sz, (sgx_sealed_data_t *) state_data);
	if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;

	return NL_STATUS_OK;
}

Figure 6. The heartbeat in the enclave.

It has to obtain the current system time and the process ID, and to do this we have added our first OCALL to the enclave, ve_o_process_info. When the OCALL returns, we update our state information and then call sgx_seal_data to seal it into the state_data buffer.

One restriction of the Intel SGX seal and unseal functions is that they can only operate on enclave memory. That means the state_data parameter must be a marshaled data buffer when used in this manner. If you need to write sealed data to a raw pointer that references untrusted memory (one that is passed with the user_check parameter), you must first seal the data to an enclave-local data buffer and then copy it over.

The OCALL is defined in EnclaveBridge.cpp:

// OCALL to retrieve the current process ID and
// local system time.

void SGX_CDECL ve_o_process_info(uint64_t *ts, uint64_t *pid)
{
	DWORD dwpid= GetCurrentProcessId();
	time_t ltime;

	time(&ltime);

	*ts = (uint64_t)ltime;
	*pid = (uint64_t)dwpid;
}

Because the heartbeat runs asynchronously, two threads can enter the enclave at the same time. This means the number of Thread Control Structures (TCSs) allocated to the enclave must be increased from the default of 1 to 2. This can be done one of two ways:

  1. Right-click the Enclave project, select Intel SGX Configuration -> Enclave Settings to bring up the configuration window, and then set Thread Number to 2 (see Figure 7).
  2. Edit the Enclave.config.xml file in the Enclave project directly, and then change the <TCSNum> parameter to 2.

Figure 7. Enclave settings dialog.

Detecting Suspend and Resume Events

A suspend and resume cycle will destroy the enclave, and that will be detected by the next ECALL. However, we shouldn’t rely on this mechanism to perform enclave recovery, because we need to act as soon as the system wakes up from the sleep state. That means we need an event listener to receive the power state change messages that are generated by Windows.

The best place to capture these is in the user interface layer. In addition to performing the enclave recovery, we must be able to lock the password vault if the system was in the sleep state longer than maximum sleep time set in the user options. When the vault is locked, the user interface also needs to be updated to reflect the new vault state.

One limitation of the Windows Presentation Foundation* is that it does not provide event hooks for power-related messages. The workaround is to hook in to the message handler for the underlying window handle. Our main application window and all of our dialog windows need a listener so that we can gracefully close each one.

The hook procedure for the main window is shown in Figure 8.

private IntPtr Main_Power_Hook(IntPtr hwnd, int msg, IntPtr wParam, IntPtr lParam, ref bool handled)
{
    UInt16 pmsg;

    // C# doesn't have definitions for power messages, so we'll get them via C++/CLI. It returns a
    // simple UInt16 that defines only the things we care about.
    pmsg= PowerManagement.message(msg, wParam, lParam);

    if ( pmsg == PowerManagementMessage.Suspend )
    {
        mgr.suspend();
        handled = true;
    } else if (pmsg == PowerManagementMessage.Resume)
    {
        int vstate = mgr.resume();

        if (vstate == ResumeVaultState.Locked) lockVault();
        handled = true;
    }

    return IntPtr.Zero;
}

Figure 8. Message hook for the main window.

To get at the messages, the handler must dip down to native code. This is done using the new PowerManagement class, which defines a static function called message, shown in Figure 9. It returns one of four values:

PWR_MSG_NONE

The message was not a power event.

PWR_MSG_OTHER

The message was power-related, but not a suspend or resume message.

PWR_MSG_RESUME

The system has woken up from a low-power or sleep state.

PWR_MSG_SUSPEND

The system is suspending to a low-power state.

UINT16 PowerManagement::message(int msg, IntPtr wParam, IntPtr lParam)
{
	INT32 subcode;

	// We only care about power-related messages

	if (msg != WM_POWERBROADCAST) return PWR_MSG_NONE;

	subcode = wParam.ToInt32();

	if ( subcode == PBT_APMRESUMEAUTOMATIC ) return PWR_MSG_RESUME;
	else if (subcode == PBT_APMSUSPEND ) return PWR_MSG_SUSPEND;

	// Don't care about other power events.

	return PWR_MSG_OTHER;
}

Figure 9. The message listener.

We actually listen for both suspend and resume messages here, but the suspend handler does very little work. When a system is transitioning to a sleep state, an application has less than 2 seconds to act on the power message. All we do with the sleep message is stop the heartbeat. This isn’t strictly necessary, and is just a precaution against having a heartbeat execute while the system is suspending.

The resume message is handled by calling the resume method in PasswordManagerCore. It’s job is to figure out whether the vault should be locked or unlocked. It does this by checking the current system time against the saved vault state (if any). If there’s no state, or if the system has slept longer than the maximum allowed, it returns ResumeVaultState.Locked.

Restoring the Enclave

In the Intel SGX code path, the enclave has to be recreated before the enclave state information can be checked. The code for this is shown in Figure 10.

bool PasswordManagerCore::restore_vault(bool flag_async)
{
	bool got_lock= false;
	int rv;

	// Only let one thread do the restore if both come in at the
	// same time. A spinlock approach is inefficient but simple.
	// This is OK for our application, but a high-performance
	// application (or one with a long-running work loop)
	// would want something else.

	try {
		slock.Enter(got_lock);

		if (_nlink->supports_sgx()) {
			bool do_restore = true;

			// This part is only needed for enclave-based vaults.

			if (flag_async) {
				// If we are entering as a result of a power event,
				// make sure the vault has not already been restored
				// by the synchronous/UI thread (ie, a failed ECALL).

				rv = _nlink->ping_vault();
				if (rv != NL_STATUS_LOST_ENCLAVE) do_restore = false;
				// If do_store is false, then we'll also use the
				// last value of rv_restore as our return value.
				// This will tell us whether or not we should lock the
				// vault.
			}

			if (do_restore) {
				// If the vaultfile isn't open then we are locked or hadn't
				// been opened to be begin with.

				if (!vaultfile->is_open()) {
					// Have we opened a vault yet?
					if (vaultfile->get_vault_path()->Length == 0) goto restore_error;

					// We were explicitly locked, so reopen.
					rv = vaultfile->open_read(vaultfile->get_vault_path());
					if (rv != NL_STATUS_OK) goto restore_error;
				}

				// Reinitialize the vault from the header.

				rv = _vault_reinitialize();
				if (rv != NL_STATUS_OK) goto restore_error;

				// Now, call to the native object to restore the vault state.
				rv = _nlink->restore_vault_state();
				if (rv != NL_STATUS_OK) goto restore_error;

				// The database password was restored to the vault. Now restore
				// the vault, itself.

				rv = send_vault_data();
			restore_error:
				restore_rv = (rv == NL_STATUS_OK);
			}
		}
		else {
			rv = _nlink->check_vault_state();
			restore_rv = (rv == NL_STATUS_OK);
		}

		slock.Exit(false);
	}
	catch (...) {
		// We don't need to do anything here.
	}

	return restore_rv;
}

Figure 10. The restore_vault() method.

The enclave and vault are reinitialized from the vault data file, and the vault state is restored using the method restore_vault_state in PasswordManagerCoreNative.

Which Thread Restores the Vault State?

The Tutorial Password Manager can have up to three threads executing at any given time. They are:

  • The main UI
  • The heartbeat
  • The power event handler

Only one of these threads should be responsible for actually restoring the enclave, but it is possible that both the heartbeat and the main UI thread are in the middle of an ECALL when a power event occurs. In that case, both ECALLs will fail with the error code SGX_ERR_ENCLAVE_LOST while the power event handler is executing. Given this potential race condition, it’s necessary to decide which thread is given the job of enclave recovery.

If the lock timeout is set to zero, there won’t be a heartbeat thread at all, so it doesn’t make sense to put enclave recovery logic there. If the heartbeat ECALL returns SGX_ERR_ENCLAVE_LOST, it simply stops the heartbeat and assumes other threads will be dealing with it.

That leaves the UI thread and the power event handler, and a good argument can be made that both threads need the ability to recover an enclave. The event handler will catch all suspend/resume cycles immediately, so it make sense to have enclave recovery happen there. However, as we pointed out earlier it is entirely possible for a power event to occur during an active ECALL on the UI thread, and there’s no reason to prevent that thread from starting the recovery, especially since it might occur before the power event message is received. This not only provides a safety net in case the event handler fails to execute for some reason, but it also provides a quick and easy retry loop for the operation.

Since we can’t have both of these threads run the recovery at the same time, we need to use locking to ensure that only the first thread to arrive is given the job. The second one simply waits for the first to finish.

It’s also possible that a failed ECALL will complete the recovery process before the event handler enters the recovery loop. To prevent the event handler from blindly repeating the enclave recovery procedure, we have added a quick test to make sure the enclave hasn’t already been recreated.

Detection in the UI Thread

The UI thread detects power events by looking for ECALLs that fail with SGX_ERR_LOST_ENCLAVE. The wrapper functions in EnclaveBridge.cpp automatically relaunch the enclave and pass the error NL_STATUS_ENCLAVE_RECREATED back up to the PasswordManagerCore object.

Each method in PasswordManagerCore handles this return code uniquely. Some methods, such as initialize, initialize_from_header, and lock_vault don’t actually have to restore state at all, but most of the others do and they call in to restore_vault as show in Figure 11.

int PasswordManagerCore::accounts_password_to_clipboard(UInt32 idx)
{
	UINT32 index = idx;
	int rv;
	int tries = 3;

	while (tries--) {
		rv = _nlink->accounts_password_to_clipboard(index);
		if (rv == NL_STATUS_RECREATED_ENCLAVE) {
			if (!restore_vault()) {
				rv = NL_STATUS_LOST_ENCLAVE;
				tries = 0;
			}
		}
		else break;
	}

	return rv;
}

Figure 11. Detecting a power event on the main UI thread.

Here, the method gets three attempts to restore the vault before giving up. This retry count of three is an arbitrary limit: it’s not likely that we’ll have multiple power events in rapid succession but it’s possible. Though we don’t want to just give up after one attempt, we also don’t want to loop forever in case there’s a system issue that prevents the enclave from ever being recreated.

Restoring and Checking State

The last step is to examine the state data for the vault and determine whether the vault should be locked or unlocked. In the Intel SGX code path, the sealed state data is sent into the enclave where it is unsealed, and then compared to current system data obtained from the OCALL ve_o_process_info. This method, restore_state, is shown in Figure 12.

int E_Vault::restore_state(char *state_data, uint32_t sz)
{
	sgx_status_t status;
	vault_state_t vault_state;
	uint64_t now, thispid;
	uint32_t szout = sz;

	// First, make an OCALL to get the current process ID and system time.
	// Make these OCALLs so that the parameters aren't be supplied by the
	// ECALL (which would make it trivial for the calling process to fake
	// this information)

	status = ve_o_process_info(&now, &thispid);
	if (status != SGX_SUCCESS) {
		// Zap the state data.
		memset_s(state_data, sz, 0, sz);
		return NL_STATUS_SGXERROR;
	}

	status = sgx_unseal_data((sgx_sealed_data_t *)state_data, NULL, 0, (uint8_t *)&vault_state, &szout);
	// Zap the state data.
	memset_s(state_data, sz, 0, sz);

	if (status != SGX_SUCCESS) return NL_STATUS_SGXERROR;

	if (thispid != vault_state.pid) return NL_STATUS_PERM;
	if (now < vault_state.lastheartbeat) return NL_STATUS_PERM;
	if (now > vault_state.lockafter) return NL_STATUS_PERM;

	// Everything checks out. Restore the key and mark the vault as unlocked.

	lock_delay = vault_state.lock_delay;

	memcpy(db_key, vault_state.db_key, 16);
	_VST_CLEAR(_VST_LOCKED);

	return NL_STATUS_OK;
}

Figure 12. Restoring state in the enclave.

Note that unsealing data is programmatically simpler than sealing it: the key derivation and policy information is embedded in the sealed data blob. Unlike data sealing there is only one unseal function, sgx_unseal_data, and it takes fewer parameters than its counterpart.

This method returns NL_STATUS_OK if the vault is restored to the unlocked state, and NL_STATUS_PERM if it is restored to the locked state.

Lingering Issues

The Tutorial Password Manager as currently implemented still has issues that need to be addressed.

  • There is still a race condition in the enclave recovery logic. Because the ECALL wrappers in EnclaveBridge.cpp immediately recreate the enclave before returning an error code to the PasswordManagerCore layer, it is possible for the power event handler thread to enter the restore_vault method after the enclave has been recreated but before the enclave recovery has completed. This can cause the power event handler to return the wrong status to the UI layer, placing the UI in the “locked” or “unlocked” state incorrectly.
  • We depend on the system clock when validating our state data, but the system clock is actually untrusted. A malicious user can manipulate the time in order to force the password vault into an unlocked state when the system wakes up (this can be addressed by using trusted time, instead).

Summary

In order to prevent cold boot attacks and other attacks against memory images in RAM, Intel SGX destroys the Enclave Page Cache whenever the system enters a low-power state. However, this added security comes at a price: software complexity that can’t be avoided. All real-world Intel SGX applications need to plan for power events and incorporate enclave recovery logic because failing to do so will lead to runtime errors during the application’s execution.

Power event planning can rapidly escalate the application’s level of sophistication. The user experience needs of the Tutorial Password Manager took us from a single-threaded application with relatively simple constructs to one with multiple, asynchronous threads, locking, and atomic memory updates via simple journaling. As a general rule, seamless enclave recovery requires careful design and a significant amount of added program logic.

Sample Code

The code sample for this part of the series builds against the Intel SGX SDK version 1.7 using Microsoft Visual Studio* 2015.

Release Notes

  • Running a mixed-mode Intel SGX application under the debugger in Visual Studio will cause an exception to be thrown if a power event is triggered. The exception occurs when an ECALL detects the lost enclave and returns SGX_ERROR_LOST_ENCLAVE.
  • The non-Intel SGX code path was updated to use Microsoft’s DPAPI to store the database encryption key. This is a better solution than the in-memory XOR’ing.

Coming Up Next

In Part 10 of the series, we’ll discuss debugging mixed-mode Intel SGX applications with Visual Studio. Stay tuned!

Build and Install TensorFlow* on Intel® Architecture

$
0
0

Introduction

TensorFlow* is a leading deep learning and machine learning framework, and as of May 2017, it now integrates optimizations for Intel® Xeon® processors and Intel® Xeon Phi™ processors. This is the first in a series of tutorials providing information for developers who want to build, install, and explore TensorFlow optimized on Intel architecture from sources available in the GitHub* repository.

Resources

The TensorFlow website is a key resource for learning about the framework, providing informative overviews, tutorials, and technical information on its various components. This is the first stop for developers interested in understanding the full extent of what TensorFlow has to offer in the area of deep learning.

The article TensorFlow Optimizations on Modern Intel® Architecture introduces the specific graph optimizations, performance experiments, and details for building and installing TensorFlow with CPU optimizations. This article is highly recommended for developers who want to understand the details of how to fully optimize TensorFlow for different topologies, and the performance improvements they can achieve in doing so.

Installation Overview

The installation steps presented in this document are distilled from information provided in the Installing TensorFlow from Sources guide on the TensorFlow website. The steps outlined below are provided to give a quick overview of the installation process; however, since third-party information is subject to change over time, it is recommended that you also review the information provided on the TensorFlow website.

The installation guidelines presented in this document focus on installing TensorFlow with CPU support only. The target operating system and Python* distribution are Ubuntu* 16.04 and Python 2.7, respectively.

Installing the Bazel* Build Tool

Bazel* is the publicly available build tool from Google*. If Bazel is already installed on your system you can skip this section. Otherwise, enter the following commands to add the Bazel distribution URI, perform the installation, and update Bazel on your system:

echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
sudo apt install curl
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install bazel
sudo apt-get upgrade bazel

Installing Python* Dependencies

If the Python dependencies are already installed on your system you can skip this section. To install the required packages for Python 2.7, enter the following command:

sudo apt-get install python-numpy python-dev python-pip python-wheel

Building a TensorFlow* Pip Package for Installation

If the program Git* is not currently installed on your system, issue the following command:

sudo apt install git

Clone the GitHub repository by issuing the following command:

git clone https://github.com/tensorflow/tensorflow

The tensorflow directory created during cloning contains a script named configure that must be executed prior to creating the pip package and installing TensorFlow. This script allows you to identify the pathname, dependencies, and other build configuration options. For TensorFlow optimized on Intel architecture, this script also allows you to set up Intel® Math Kernel Library (Intel® MKL) related environment settings. Execute the following commands:

cd tensorflow
./configure

Important: Select ‘Y’ to build TensorFlow with Intel MKL support, and ‘Y’ to download MKL LIB from the web. Select the default settings for the other configuration parameters. When the script has completed running, issue the following command to build the pip package:

bazel build --config=mkl --copt="-DEIGEN_USE_VML" -c opt //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

Installing TensorFlow—Native Pip Option

At this point in the process the newly created pip package will be located in tmp/tensorflow_pkg. The next step is to install TensorFlow, which can be done either as a native pip installation, or in an Anaconda* virtual environment as described in the next section. For a native pip installation simply enter the following command:

sudo pip install /tmp/tensorflow_pkg/tensorflow-1.2.0rc1-cp27-cp27mu-linux_x86_64.whl

(Note: The name of the wheel, as shown above in italics, may be different for your particular build.)

Once these steps have been completed be sure to validate the installation before proceeding to the next section. Note: When running the Python validation script provided in the link, be sure to change to a different directory, for example:

cd ..

Installing TensorFlow—Conda* Environment Option

Note: If you already have Anaconda installed on your system you can skip this step.

Download Anaconda from the download page and follow the directions to run the installer script. (For this tutorial, we used the 64-bit, x86, Python 2.7 version of Anaconda.) During the installation you need to agree to the license, choose the defaults, and choose 'yes' to add Anaconda to your path. Once the installation is complete, close the terminal and open a new one.

Next, we will create a conda environment and install TensorFlow from the newly created pip package located in tmp/tensorflow_pkg. Run the following commands to create a TensorFlow environment called "inteltf" and issue the following commands:

conda create -n inteltf
source activate inteltf
pip install /tmp/tensorflow_pkg/tensorflow-1.2.0rc1-cp27-cp27mu-linux_x86_64.whl

(Note: The name of the wheel, as shown above in italics, may be different for your particular build.)

source deactivate inteltf

Close the terminal and open a new one before proceeding.

Restart the inteltf environment and validate the TensorFlow installation by running the following Python code from the website:

source activate inteltf
	python>>> import tensorflow as tf>>> hello = tf.constant('Hello, TensorFlow!')>>> sess = tf.Session()>>> print(sess.run(hello))

The Python program should output “Hello, TensorFlow!” if the installation was successful.

Coming Up

The next article in the series describes how to install TensorFlow Serving*, a high-performance serving system for machine learning models designed for production environments.

Build and Install TensorFlow* Serving on Intel® Architecture

$
0
0

Introduction

The first tutorial in this series, Build and Install TensorFlow* on Intel® Architecture, demonstrated how to build and install TensorFlow optimized on Intel architecture from sources available in the GitHub* repository. The information provided in this paper describes how to build and install TensorFlow* Serving, a high-performance serving system for machine learning models designed for production environments.

Installation Overview

The installation guidelines presented in this document are distilled from information available on the TensorFlow Serving GitHub website. The steps outlined below are provided to give a quick overview of the installation process; however, since third-party information is subject to change over time it is recommended that you also review the information provided on the TensorFlow Serving website.

Important: The step-by-step guidelines provided below assume the reader has already completed the tutorial Build and Install TensorFlow on Intel® Architecture, which includes the steps to install the Bazel* build tool and some of the other required dependencies not covered here.

Installing gRPC*

Begin by installing the Google Protocol RPC* library (gRPC*), a framework for implementing remote procedure call (RPC) services.

sudo pip install grpcio

Installing Dependencies

Next, ensure the other TensorFlow Serving dependencies are installed by issuing the following command:

sudo apt-get update && sudo apt-get install -y \
build-essential \
curl \
libcurl3-dev \
git \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
python-dev \
python-numpy \
python-pip \
software-properties-common \
swig \
zip \
zlib1g-dev

Installing TensorFlow* Serving

Clone TensorFlow Serving from the GitHub repository by issuing the following command:

   git clone --recurse-submodules https://github.com/tensorflow/serving

The serving/tensorflow directory created during the cloning process contains a script named “configure” that must be executed to identify the pathname, dependencies, and other build configuration options. For TensorFlow optimized on Intel architecture, this script also allows you to set up Intel® Math Kernel Library (Intel® MKL) related environment settings. Issue the following commands:

cd serving/tensorflow
./configure

Important: Select ‘Y’ to build TensorFlow with MKL support, and ‘Y’ to download MKL LIB from the web. Select the default settings for the other configuration parameters.

cd ..
bazel build --config=mkl --copt="-DEIGEN_USE_VML" tensorflow_serving/...

Testing the Installation

Test the TensorFlow Serving installation by issuing the following command:

bazel test tensorflow_serving/...

If everything worked OK you should see results similar to Figure 1.

Screenshot of a command prompt window with results of correct installation

Figure 1. TensorFlow Serving installation test results.

Coming Up

The next article in this series describes how to train and save a TensorFlow model, host the model in TensorFlow Serving, and use the model for inference in a client-side application.

Train and Use a TensorFlow* Model on Intel® Architecture

$
0
0

Introduction

TensorFlow* is a leading deep learning and machine learning framework, and as of May 2017, it now integrates optimizations for Intel® Xeon® processors and Intel® Xeon Phi™ processors. This is the third in a series of tutorials providing information for developers who want to build, install, and explore TensorFlow optimized on Intel® architecture from sources available in the GitHub* repository.

The first tutorial in this series Build and Install TensorFlow for Intel Architecture demonstrates how to build and install TensorFlow optimized on Intel architecture from sources in the GitHub* repository.

The second tutorial in the series Build and Install TensorFlow Serving on Intel Architecture describes how build and install TensorFlow Serving, a high-performance serving system for machine learning models designed for production environments.

In this tutorial we will train and save a TensorFlow model, build a TensorFlow model server, and test the server using a client application. This tutorial is based on the MNIST for ML Beginners and Serving a TensorFlow Model tutorials on the TensorFlow website. You are encouraged to review these tutorials before proceeding to fully understand the details of how models are trained and saved.

Train and Save a MNIST Model

According to Wikipedia, the MNIST (Modified National Institute of Standards and Technology) database contains 60,000 training images and 10,000 testing images used for training and testing in the field of machine learning. Because of its relative simplicity, the MNIST database is often used as an introductory dataset for demonstrating machine learning frameworks.

To get started, open a terminal and issue the following commands:

cd ~/serving
bazel build //tensorflow_serving/example:mnist_saved_model
rm -rf /tmp/mnist_model
bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model

Troubleshooting: At the time of this writing, the TensorFlow Serving repository identified an error logged as “NotFoundError in mnist_export example #421.” If you encounter an error after issuing the last command try this workaround:

  1. Open serving bazel-bin/tensorflow_serving/example/mnist_saved_model.runfiles/ org_tensorflow/tensorflow/contrib/image/__init__.py
  2. Comment-out (#) the following line as shown:
    #from tensorflow.contrib.image.python.ops.single_image_random_dot_stereograms import single_image_random_dot_stereograms
  3. Save and close __init__.py.
  4. Try issuing the command again:
    bazel-bin/tensorflow_serving/example/mnist_saved_model /tmp/mnist_model

Since we omitted the training_iterations and model_version command-line parameters when we ran mnist_saved_model, they defaulted to 1000 and 1, respectively. Because we passed /tmp/mnist_model for the export directory, the trained model was saved in /tmp/mnist_model/1.

As explained in the TensorFlow tutorial documentation, the “1” version sub-directory contains the following files:

  • saved_model.pb is the serialized tensorflow::SavedModel. It includes one or more graph definitions of the model, as well as metadata of the model such as signatures.
  • variables are files that hold the serialized variables of the graphs.

Troubleshooting: In some instances you might encounter an issue with the downloaded training files getting corrupted when the script runs. This error is identified as "Not a gzipped file #170" on GitHub. If necessary, these files can be downloaded manually by issuing the following commands from the /tmp directory:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Build and Start the TensorFlow Model Server

Build the TensorFlow model server by issuing the following command:

bazel build //tensorflow_serving/model_servers:tensorflow_model_server

Start the TensorFlow model server by issuing the following command:

bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=mnist --model_base_path=/tmp/mnist_model/ &

Test the TensorFlow Model Server

The last command started the ModelServer running in the terminal. To test the server using the mnist_client utility provided in the TensorFlow Serving installation, enter the following commands from the /serving directory:

bazel build //tensorflow_serving/example:mnist_client
bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=localhost:9000

If everything worked, you should see results similar to Figure 1.

Screenshot of a command prompt window with client test results

Figure 1. TensorFlow client test results

Troubleshooting: There is an error identified on GitHub as “gRPC doesn't respect the no_proxy environment variable” that may result in an “Endpoint read failed” error when you run the client application. Issue the env command to see if the http_proxy environment variable is set. If so, it can be temporarily unset by issuing the following command:

unset http_proxy

Summary

In this series of tutorials we explored the process of building the TensorFlow machine learning framework and TensorFlow Serving, a high-performance serving system for machine learning models, optimized for Intel architecture. A simple model based on the MNIST dataset was trained and saved, and it was then deployed using a TensorFlow model server. Lastly, the mnist_client example included in the GitHub repository was used to demonstrate how a client-side application can leverage a TensorFlow model server to do simple machine learning inference.

For additional information on this subject please visit the TensorFlow website, a key resource for learning more about the framework. The article entitled “TensorFlow Optimizations on Modern Intel Architecture” introduces the specific graph optimizations, performance experiments, and details for building and installing TensorFlow with CPU optimizations.

Viewing all 1142 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>