Projects:2016s1-102 Classifying Internet Applications and Detecting Malicious Traffic from Network Communications

From Projects
Revision as of 21:10, 26 October 2016 by A1629981 (talk | contribs) (1. Distribution Plots)
Jump to: navigation, search

Project Team

Karl Hornlund

Jason Trann

Supervisors

Assoc Prof Cheng Chew Lim

Dr Hong Gunn Chew

Dr Adriel Cheng (DSTG)

Introduction

The project aims to use machine learning to predict the application class of computer network traffic. In particular, we will explore the usefulness of graph based techniques to extract additional features and provide a simplified model for classification; and, evaluate the classification performance with respect to identifying malicious network traffic.

Objectives

- Implement a supervised machine learning system which utilises NetFlow data and spatial traffic statistics to classify network traffic, as described by Jin et al. [12] [18] [19].

- Achieve an appropriate level of accuracy when benchmarked against previous years’ iterations of the project and verify the results of Jin et al. [18].

- Evaluate the effectiveness of using spatial traffic statistics, in particular with respect to identifying malicious traffic.

- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].

Stage 1: Bootstrap

The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG).

1. Constructing the Traffic Activity Graph

The TAG is constructed as follows:

(1) Map each unique host in the network to a node in the TAG.

(2) For each flow in the network, create a directed edge between the respective nodes in the TAG, corresponding to the source and destination hosts of the flow. Assign the label of that flow as an edge attribute; do the same for duration, packets, and bytes.

(3) Calculate two additional edge attributes:

mean packet size = bytes/packets

mean packet rate = packets/duration

(4)For each set of edges with both nodes in common, perform the following simplification:

Assume x edges e_1,e_2,…,e_x =(u,v).

Create a new undirected edge e_(x+1),and assign it the following attributes:

label(u,v) ∶= the most common application class label among e_1,e_2,…,e_x.

min⁡duration ≔ minimum duration among e_1,e_2,…,e_x.

min⁡packet size ∶= minimum mean packet size among e_1,e_2,…,e_x.

min⁡packet rate ∶= minimum mean packet rate among e_1,e_2,…,e_x.

max⁡duration ≔ maximum duration among e_1,e_2,…,e_x.

max⁡packet size ∶= maximum mean packet size among e_1,e_2,…,e_x.

max⁡packet rate ∶= maximum mean packet rate among e_1,e_2,…,e_x.

bytes_uv ≔ sum of bytes flowing from u to v.

bytes_vu ≔ sum of bytes flowing from v to u.

symmetry ∶= min⁡((bytes_uv)/(bytes_vu ),(bytes_vu)/(bytes_uv ))

Remove edges e_1,e_2,…,e_x from the TAG.

(5) Remove any loops (edges which leave from, and enter the same node) from the TAG.

The following set of edge level features are used for bootstrap classification:

  • Min Packet Size
  • Max Packet Size
  • Min Duration
  • Max Duration
  • Min Packet Rate
  • Max Packet Rate
  • Symmetry

2. Optimal Bootstrap Feature Selection

From the given edge level features above, to find the optimal set of features to be used within the bootstrap classifier, a brute force (trial and error) method was used. The method is outlined below:

1.Use all x available features in ML algorithm and record bootstrap results

2.Test and record all combinations of x-1 features

3.Evaluate bootstrap results from step 2 and compare which feature removal results in the largest decrease in accuracy

4.Repeat step 2 for x-2 combinations of features but retain the features that contribute the most to the final bootstrap accuracy and remove features that have little contribution to final accuracy

5.Repeat reduction in x feature combinations until accuracy results do not improve.

The resulting optimal set of edge features is:

  • Min Packet Size
  • Max Packet Size
  • Min Packet Rate
  • Max Packet Rate
  • Symmetry

Center

In the example above, from test 2-5, the accuracy shows that Feature 1 & 4 has a significant impact in the reduction of bootstrap accuracy, meaning that keeping the features are essential to a high accuracy. Thus in test 6-8, by retaining Feature 1 (which resulted in the largest decrease in accuracy), by testing different feature combinations with two features, the results verify that Features 1 & 4 are the most important features to generate the highest bootstrap accuracy.

3. Tree Based Classifier

The project will used a Random Forest Classifier in R, such that results can be directly compared to those form the 2014 project.

There are two main parameters to optimise the Random Forest algorithm:

ntree ≔ Number of decision trees to grow

nodesize ≔ Size of terminal node or the depth of the tree

The optimal nodesize was found by an iterative brute-force method.

Assume N features to be used.

Try nodesize = √N.

Then try nodesize = 2k√N, and nodesize = √N/(2k), k=1,2,3…

The value which led to the highest classification accuracy in the tests was chosen as the optimal parameter value.

To settle on a value for ntree, again an iterative process was used. By starting with an arbitrarily small value, and observing the out-of-bag error rate (OOB) as ntree is increased, a value for ntree was chosen as the point where the OOB plateaus. (Note that the out of bag error rate refers to the prediction error of ML algorithm.)

The parameter values to be used for the bootstrap classifier are as follows:

ntree = 20

nodesize = 10

Stage 2: Calibration

In the second stage of the system, a graph-based approach to extract spatial traffic statistics from the initial application label predictions will be used. These spatial traffic statistics will then be used as features in a second iteration of ML classification, effectively calibrating the initial classification.

Overview System.png


1. Creating Vertex Histograms

In this step spatial traffic statistics are extracted, and the information stored in a histogram as a node attribute.

Assume N application classes. Number each class 1 to N.

Let deg_k(v) be the degree of a node v only counting edges with application class k, k ∈ [1,N].

Let deg(v) be the total degree of a node v.

Let deg ̂_k(v) be the normalised degree, deg ̂_k(v) = deg_k (v)/deg(v)

For each node v in the TAG:

Create a histogram h with N elements,such that h[k] = deg ̂_k (v)

Assign this histogram as a node attribute to v

2. Creating Spatial Traffic Statistics Features

The spatial traffic statistics to be used as features in the calibration stage are extracted from the vertex histograms as follows:

Assume N application classes.

For each edge e = (u,v) in the TAG:

Let h_u be the histogram of vertex u, h_v the histogram of vertex v

Assign edge attributes x_1, x_2,…, x_n, x_deg, y_1, y_2,…, y_n, y_deg to e such that:

x_k = h_u[k], y_k = h_v[k], k ∈ [1,N]

x_deg = deg(u), y_deg = deg(v)

For N application classes, this will result in 2*(N+1) features to be used for calibration classification.

3. Performing Calibration

A Random Forest classifier will also be used in the calibration stage. Using the same process as described in section 4.2.3 the following parameters are to be used for the calibration classifier:

ntree = 10

nodesize = 20

The output will be a set of calibrated edge application labels, which are a final estimate as to the application class responsible for the majority of the network traffic between two hosts in the network.

4. Mapping Edge Predictions to Flow Predictions

As the system classifies edges, it is of interest to know how accurately these predictions classify flows. The following 1..N mapping from edge labels to flow labels is used to determine this:

For each edge e = (u,v) in the TAG:

Let L_edge(e) be the application class prediction for edge e.

Let L_flow(f) be the application class prediction for flow f.

Assign L_flow(f) ← L_edge(e) ∀ f = (w,z) |[w∈{u,v} ⋀ z∈{u,v}]

Preliminary Investigation

This section aims to answer the question ‘how accurately can a single application class label represent the traffic between two hosts in a network?’.

1. Distribution Plots

In order to represent the distribution plots of each dataset, several histograms are created. The histograms illustrate the majority (dominant) class type per edge for a particular dataset capture. The measures of the histogram are as follows:

- Independent variable is the percentage dominance of a class per edge.

- Dependent variable is the measure (occurrences) of a particular percentage dominant class type per edge.

These histograms are calculated by using the IP addresses from the NetFlow data for each dataset and also use the ground truth label for each individual flow. These histograms demonstrate the majority confidence for each edge as a result.

UPC I edge labelp2p.png

UPC I edge.png

By observing the log scale graphs, there is an obvious concentration of flows in the 100% bin. The peak in the 50% is likely due to the fact that two application classes, which are both a 50% majority, could both be considered the ‘dominant’ class. This sparked interest into how often a 50/50 split among application classes would occur, which led to the following investigation into how many edges and flows belonged to this 50% majority category.

50% table.png

The table above represents the number of occurrences where there is a 50/50 ratio of classes per edge/flow. The 50/50 ratio per edge represents the number of instances where an edge consists of two application classes with the same number of flows along the edge, while the 50/50 ratio for flows is the total number of flows that occur within all 50% majority edges. Also included is the total number of edges and flows per capture. The number of instances of a 50% majority is then calculated as a percentage of total edges/flows.

The results from the graphs above, show that at least 90%+ of flows with an edge consists of 1 dominant class type. In particular, for 50% dominant instances per edge and per flow, instances are low for each dataset (1.45% of all edges and 1.78% for all flows).

2. Confusion Matrix

A second test supported the equivalency between flow labels and edge labels; it determines an upper bound for the accuracy achievable mapping edge labels to flow labels by emulating the scenario where the calibration stage produces an optimal set of edge labels.

An optimal set of edge labels was constructed as follows: for each edge, select the most commonly occurring ground truth label among the relevant flows as the edge label. Then, map the set of optimal edge labels back to the relevant flows to create a set of derived flow labels. By comparing the original ground truth labels to the derived flow labels in a confusion matrix, one can determine how often flows do not belong to the majority label for that edge.

Confusion Matrix.png

These results support the statement that: by weighted average, 98.38% of flows between a given two hosts are of the same application class. The implication of this, is that there is on average an upper bound of 98.38% regarding the accuracy achievable using this method. Given that this project is benchmarked against results of approximately 92% accuracy [4] [19], the preliminary investigation suggests that it is possible to surpass this benchmark with the proposed method.