Difference between revisions of "Projects:2016s1-102 Classifying Internet Applications and Detecting Malicious Traffic from Network Communications"

From Projects
Jump to: navigation, search
(2. Optimal Bootstrap Feature Selection)
(2. Optimal Bootstrap Feature Selection)
Line 101: Line 101:
  
 
5.Repeat reduction in x feature combinations until accuracy results do not improve.
 
5.Repeat reduction in x feature combinations until accuracy results do not improve.
 
[[File:Example_table.png|thumb|Figure 1: Example Table]]
 
  
 
The resulting optimal set of edge features is:
 
The resulting optimal set of edge features is:
Line 111: Line 109:
 
* Max Packet Rate
 
* Max Packet Rate
 
* Symmetry
 
* Symmetry
 +
[[File:Example_table.png|thumb|Figure 1: Example Table]]
 +
 +
In Figure 1, from test 2-5, the accuracy shows that Feature 1 & 4 has a significant impact in the reduction of bootstrap accuracy, meaning that keeping the features are essential to a high accuracy. Thus in test 6-8, by retaining Feature 1 (which resulted in the largest decrease in accuracy), by testing different feature combinations with two features, the results verify that Features 1 & 4 are the most important features to generate the highest bootstrap accuracy.
  
 
== 3. Tree Based Classifier ==
 
== 3. Tree Based Classifier ==

Revision as of 19:38, 26 October 2016

Project Team

Karl Hornlund

Jason Trann

Supervisors

Assoc Prof Cheng Chew Lim

Dr Hong Gunn Chew

Dr Adriel Cheng (DSTG)

Introduction

The project aims to use machine learning to predict the application class of computer network traffic. In particular, we will explore the usefulness of graph based techniques to extract additional features and provide a simplified model for classification; and, evaluate the classification performance with respect to identifying malicious network traffic.

Objectives

- Implement a supervised machine learning system which utilises NetFlow data and spatial traffic statistics to classify network traffic, as described by Jin et al. [12] [18] [19].

- Achieve an appropriate level of accuracy when benchmarked against previous years’ iterations of the project and verify the results of Jin et al. [18].

- Evaluate the effectiveness of using spatial traffic statistics, in particular with respect to identifying malicious traffic.

- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].

Stage 1: Bootstrap

The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG).

1. Constructing the Traffic Activity Graph

The TAG is constructed as follows:

(1) Map each unique host in the network to a node in the TAG.

(2) For each flow in the network, create a directed edge between the respective nodes in the TAG, corresponding to the source and destination hosts of the flow. Assign the label of that flow as an edge attribute; do the same for duration, packets, and bytes.

(3) Calculate two additional edge attributes:

mean packet size = bytes/packets

mean packet rate = packets/duration

(4)For each set of edges with both nodes in common, perform the following simplification:

Assume x edges e_1,e_2,…,e_x =(u,v).

Create a new undirected edge e_(x+1),and assign it the following attributes:

label(u,v) ∶= the most common application class label among e_1,e_2,…,e_x.

min⁡duration ≔ minimum duration among e_1,e_2,…,e_x.

min⁡packet size ∶= minimum mean packet size among e_1,e_2,…,e_x.

min⁡packet rate ∶= minimum mean packet rate among e_1,e_2,…,e_x.

max⁡duration ≔ maximum duration among e_1,e_2,…,e_x.

max⁡packet size ∶= maximum mean packet size among e_1,e_2,…,e_x.

max⁡packet rate ∶= maximum mean packet rate among e_1,e_2,…,e_x.

bytes_uv ≔ sum of bytes flowing from u to v.

bytes_vu ≔ sum of bytes flowing from v to u.

symmetry ∶= min⁡((bytes_uv)/(bytes_vu ),(bytes_vu)/(bytes_uv ))

Remove edges e_1,e_2,…,e_x from the TAG.

(5) Remove any loops (edges which leave from, and enter the same node) from the TAG.

The following set of edge level features are used for bootstrap classification:

  • Min Packet Size
  • Max Packet Size
  • Min Duration
  • Max Duration
  • Min Packet Rate
  • Max Packet Rate
  • Symmetry

2. Optimal Bootstrap Feature Selection

From the given edge level features above, to find the optimal set of features to be used within the bootstrap classifier, a brute force (trial and error) method was used. The method is outlined below:

1.Use all x available features in ML algorithm and record bootstrap results

2.Test and record all combinations of x-1 features

3.Evaluate bootstrap results from step 2 and compare which feature removal results in the largest decrease in accuracy

4.Repeat step 2 for x-2 combinations of features but retain the features that contribute the most to the final bootstrap accuracy and remove features that have little contribution to final accuracy

5.Repeat reduction in x feature combinations until accuracy results do not improve.

The resulting optimal set of edge features is:

  • Min Packet Size
  • Max Packet Size
  • Min Packet Rate
  • Max Packet Rate
  • Symmetry
Figure 1: Example Table

In Figure 1, from test 2-5, the accuracy shows that Feature 1 & 4 has a significant impact in the reduction of bootstrap accuracy, meaning that keeping the features are essential to a high accuracy. Thus in test 6-8, by retaining Feature 1 (which resulted in the largest decrease in accuracy), by testing different feature combinations with two features, the results verify that Features 1 & 4 are the most important features to generate the highest bootstrap accuracy.

3. Tree Based Classifier