Difference between revisions of "Projects:2016s1-102 Classifying Internet Applications and Detecting Malicious Traffic from Network Communications"
(→Stage 1: Bootstrap) |
(→Stage 1: Bootstrap) |
||
Line 45: | Line 45: | ||
The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG). | The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG). | ||
− | '''Constructing the Traffic Activity Graph''' | + | '''1. Constructing the Traffic Activity Graph''' |
The TAG is constructed as follows: | The TAG is constructed as follows: | ||
Line 88: | Line 88: | ||
(5) Remove any loops (edges which leave from, and enter the same node) from the TAG. | (5) Remove any loops (edges which leave from, and enter the same node) from the TAG. | ||
+ | |||
+ | The following set of edge level features are used for bootstrap classification: | ||
+ | |||
+ | * Min Packet Size | ||
+ | * Max Packet Size | ||
+ | * Min Duration | ||
+ | * Max Duration | ||
+ | * Min Packet Rate | ||
+ | * Max Packet Rate | ||
+ | * Symmetry | ||
+ | |||
+ | '''2. Optimal Bootstrap Feature Selection''' | ||
+ | |||
+ | From the given edge level features above, to find the optimal set of features to be used within the bootstrap classifier, a brute force (trial and error) method was used. The method is outlined below: | ||
+ | |||
+ | 1.Use all x available features in ML algorithm and record bootstrap results | ||
+ | 2.Test and record all combinations of x-1 features | ||
+ | 3.Evaluate bootstrap results from step 2 and compare which feature removal results | ||
+ | in the largest decrease in accuracy | ||
+ | 4.Repeat step 2 for x-2 combinations of features | ||
+ | but retain the features that contribute the most to the final bootstrap accuracy | ||
+ | and remove features that have little contribution to final accuracy | ||
+ | 5.Repeat reduction in x feature combinations until accuracy results do not improve. |
Revision as of 19:23, 26 October 2016
Project Team
Karl Hornlund
Jason Trann
Supervisors
Assoc Prof Cheng Chew Lim
Dr Hong Gunn Chew
Dr Adriel Cheng (DSTG)
Introduction
The project aims to use machine learning to predict the application class of computer network traffic. In particular, we will explore the usefulness of graph based techniques to extract additional features and provide a simplified model for classification; and, evaluate the classification performance with respect to identifying malicious network traffic.
Objectives
- Implement a supervised machine learning system which utilises NetFlow data and spatial traffic statistics to classify network traffic, as described by Jin et al. [12] [18] [19].
- Achieve an appropriate level of accuracy when benchmarked against previous years’ iterations of the project and verify the results of Jin et al. [18].
- Evaluate the effectiveness of using spatial traffic statistics, in particular with respect to identifying malicious traffic.
- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].
Introduction
The project aims to use machine learning to predict the application class of computer network traffic. In particular, we will explore the usefulness of graph based techniques to extract additional features and provide a simplified model for classification; and, evaluate the classification performance with respect to identifying malicious network traffic.
Objectives
- Implement a supervised machine learning system which utilises NetFlow data and spatial traffic statistics to classify network traffic, as described by Jin et al. [12] [18] [19].
- Achieve an appropriate level of accuracy when benchmarked against previous years’ iterations of the project and verify the results of Jin et al. [18].
- Evaluate the effectiveness of using spatial traffic statistics, in particular with respect to identifying malicious traffic.
- Explore improvements and extensions on the current method prescribed by Jin et al. [12] [18] [19].
Stage 1: Bootstrap
The bootstrap stage begins by constructing the edge level features to be used as inputs to the supervised machine learning system. Edge level features are built from the flow level features of the NetFlow data, as part of the process of building the Traffic Activity Graph (TAG).
1. Constructing the Traffic Activity Graph
The TAG is constructed as follows:
(1) Map each unique host in the network to a node in the TAG.
(2) For each flow in the network, create a directed edge between the respective nodes in the TAG, corresponding to the source and destination hosts of the flow. Assign the label of that flow as an edge attribute; do the same for duration, packets, and bytes.
(3) Calculate two additional edge attributes:
mean packet size = bytes/packets
mean packet rate = packets/duration
(4)For each set of edges with both nodes in common, perform the following simplification:
Assume x edges e_1,e_2,…,e_x =(u,v).
Create a new undirected edge e_(x+1),and assign it the following attributes:
label(u,v) ∶= the most common application class label among e_1,e_2,…,e_x.
minduration ≔ minimum duration among e_1,e_2,…,e_x.
minpacket size ∶= minimum mean packet size among e_1,e_2,…,e_x.
minpacket rate ∶= minimum mean packet rate among e_1,e_2,…,e_x.
maxduration ≔ maximum duration among e_1,e_2,…,e_x.
maxpacket size ∶= maximum mean packet size among e_1,e_2,…,e_x.
maxpacket rate ∶= maximum mean packet rate among e_1,e_2,…,e_x.
bytes_uv ≔ sum of bytes flowing from u to v.
bytes_vu ≔ sum of bytes flowing from v to u.
symmetry ∶= min((bytes_uv)/(bytes_vu ),(bytes_vu)/(bytes_uv ))
Remove edges e_1,e_2,…,e_x from the TAG.
(5) Remove any loops (edges which leave from, and enter the same node) from the TAG.
The following set of edge level features are used for bootstrap classification:
- Min Packet Size
- Max Packet Size
- Min Duration
- Max Duration
- Min Packet Rate
- Max Packet Rate
- Symmetry
2. Optimal Bootstrap Feature Selection
From the given edge level features above, to find the optimal set of features to be used within the bootstrap classifier, a brute force (trial and error) method was used. The method is outlined below:
1.Use all x available features in ML algorithm and record bootstrap results 2.Test and record all combinations of x-1 features 3.Evaluate bootstrap results from step 2 and compare which feature removal results in the largest decrease in accuracy 4.Repeat step 2 for x-2 combinations of features but retain the features that contribute the most to the final bootstrap accuracy and remove features that have little contribution to final accuracy 5.Repeat reduction in x feature combinations until accuracy results do not improve.