Difference between revisions of "Projects:2016s1-102 Classifying Internet Applications and Detecting Malicious Traffic from Network Communications"

From Projects
Jump to: navigation, search
(Introduction)
(Introduction)
Line 26: Line 26:
  
 
- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].
 
- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].
 +
 +
== Stage 1: Bootstrap ==
 +
 +
The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG).
 +
 +
'''Constructing the Traffic Activity Graph'''
 +
 +
The TAG is constructed as follows:
 +
 +
(1) Map each unique host in the network to a node in the TAG.
 +
 +
(2) For each flow in the network, create a directed edge between the respective nodes in the TAG, corresponding to the source and destination hosts of the flow. Assign the label of that flow as an edge attribute; do the same for duration, packets, and bytes.
 +
 +
(3) Calculate two additional edge attributes:
 +
 +
mean packet size =  bytes/packets
 +
 +
mean packet rate =  packets/duration
 +
 +
(4)For each set of edges with both nodes in common, perform the following simplification:
 +
 +
Assume x edges e_1,e_2,…,e_x  =(u,v).
 +
 +
Create a new undirected edge e_(x+1),and assign it the following attributes:
 +
 +
''label(u,v)'' ∶= the most common application class label among e_1,e_2,…,e_x.
 +
 +
''min⁡duration'' ≔ minimum duration among e_1,e_2,…,e_x.
 +
 +
''min⁡packet size'' ∶= minimum mean packet size among e_1,e_2,…,e_x.
 +
 +
''min⁡packet rate'' ∶= minimum mean packet rate among e_1,e_2,…,e_x.
 +
 +
''max⁡duration'' ≔ maximum duration among e_1,e_2,…,e_x.
 +
 +
''max⁡packet size'' ∶= maximum mean packet size among e_1,e_2,…,e_x.
 +
 +
''max⁡packet rate'' ∶= maximum mean packet rate among e_1,e_2,…,e_x.
 +
 +
''bytes_uv'' ≔ sum of bytes flowing from u to v.
 +
 +
''bytes_vu'' ≔ sum of bytes flowing from v to u.
 +
 +
''symmetry'' ∶= min⁡((bytes_uv)/(bytes_vu ),(bytes_vu)/(bytes_uv ))
 +
 +
Remove edges e_1,e_2,…,e_x  from the TAG.
 +
 +
(5) Remove any loops (edges which leave from, and enter the same node) from the TAG.

Revision as of 19:15, 26 October 2016

Project Team

Karl Hornlund

Jason Trann

Supervisors

Assoc Prof Cheng Chew Lim

Dr Hong Gunn Chew

Dr Adriel Cheng (DSTG)

Introduction

The project aims to use machine learning to predict the application class of computer network traffic. In particular, we will explore the usefulness of graph based techniques to extract additional features and provide a simplified model for classification; and, evaluate the classification performance with respect to identifying malicious network traffic.

Objectives

- Implement a supervised machine learning system which utilises NetFlow data and spatial traffic statistics to classify network traffic, as described by Jin et al. [12] [18] [19].

- Achieve an appropriate level of accuracy when benchmarked against previous years’ iterations of the project and verify the results of Jin et al. [18].

- Evaluate the effectiveness of using spatial traffic statistics, in particular with respect to identifying malicious traffic.

- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].

Stage 1: Bootstrap

The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG).

Constructing the Traffic Activity Graph

The TAG is constructed as follows:

(1) Map each unique host in the network to a node in the TAG.

(2) For each flow in the network, create a directed edge between the respective nodes in the TAG, corresponding to the source and destination hosts of the flow. Assign the label of that flow as an edge attribute; do the same for duration, packets, and bytes.

(3) Calculate two additional edge attributes:

mean packet size = bytes/packets

mean packet rate = packets/duration

(4)For each set of edges with both nodes in common, perform the following simplification:

Assume x edges e_1,e_2,…,e_x =(u,v).

Create a new undirected edge e_(x+1),and assign it the following attributes:

label(u,v) ∶= the most common application class label among e_1,e_2,…,e_x.

min⁡duration ≔ minimum duration among e_1,e_2,…,e_x.

min⁡packet size ∶= minimum mean packet size among e_1,e_2,…,e_x.

min⁡packet rate ∶= minimum mean packet rate among e_1,e_2,…,e_x.

max⁡duration ≔ maximum duration among e_1,e_2,…,e_x.

max⁡packet size ∶= maximum mean packet size among e_1,e_2,…,e_x.

max⁡packet rate ∶= maximum mean packet rate among e_1,e_2,…,e_x.

bytes_uv ≔ sum of bytes flowing from u to v.

bytes_vu ≔ sum of bytes flowing from v to u.

symmetry ∶= min⁡((bytes_uv)/(bytes_vu ),(bytes_vu)/(bytes_uv ))

Remove edges e_1,e_2,…,e_x from the TAG.

(5) Remove any loops (edges which leave from, and enter the same node) from the TAG.