Projects:2017s1-101 Classifying Network Traffic Flows with Deep-Learning
Contents
Project Team
Clinton Page
Daniel Smit
Kyle Millar
Supervisors
Dr Cheng-Chew Lim
Dr Hong-Gunn Chew
Dr Adriel Cheng (DST Group)
Introduction
The internet has become a key facilitator of large-scale global communications and is vital in providing an immeasurable number of services every day. With the ever-expanding growth of internet use, it is critical to effectively manage the underlying networks that hold it together. Network traffic classification plays a crucial role in this management, providing quality of service, forecasting future trends, and detecting potential security threats. For these reasons, accurate network traffic classification is of great interest to internet service providers (ISPs), large-scale enterprise companies, and government agencies alike.
Current methods of network traffic classification have become less effective in recent years due to the increasing trend of obscuring network activity, whether it be for security, priority, or malicious intent [1, 2, 3]. Therefore, in today’s network there arises a need for a more effective classification algorithm to handle these conditions.
Objectives
- Gain knowledge about the application of deep-learning for classifying network traffic flows
- Conduct experiments on synthetic traffic flows and/or make use of communications flow data from real-life enterprise networks
- Develop network traffic classifying software using deep-learning techniques to an acceptable accuracy when comparing against the results of previous years
Relevant Work
Extensive research has been performed on network traffic classification. Common techniques include port-based classification and deep pack inspection (DPI). Port-based classification performs poorly due to the usage of non-standard port numbers, and DPI requires updates to recognise unseen classes of network traffic. Moore and Papagiannaki [4] found that using port-based classification as the sole classifier resulted in classification accuracies of less than 70%. nDPI, an open-source tool for investigating traffic using DPI, shows that a high classification accuracy (around 99%) can be achieved for standard applications but will deteriorate depending on how common the application is or whether it has been encrypted [5].
The use of machine learning techniques for network traffic classification has therefore been researched extensively to tackle the problems with these traditional methods. Previous iterations of this project have investigated using various machine learning techniques to preform network classification. In the 2014, tree based and support vector machines (SVM) algorithms were investigated. Using the Universitat Polit`ecnica de Catalunya (UPC) data set [6] they were able to classify botnet traffic from legitimate traffic with up to 94% classification accuracy using decision trees and 89% classification accuracy using SVM techniques [7]. In 2016, 10 graph-based methods which utilised spatial traffic statistics were explored and achieved classification accuracies up to 95% using the same UPC data set [8].
Auld et al. [3] investigated the use of statistical data (e.g., number of packets per flow) as inputs to a neural network. The study used Bayesian Neural Networks and created a model using 246 selected flow level features. A list of the most valuable features was then created based on the weightings in the neural network, and included in the report. With this method, an accuracy of 95.8% was achieved with over 200 features.
Trivedi et al. [9] found similar results utilising a neural network to classify network traffic based upon the variance of packet lengths. Comparing the neural network against a clustering approach, it was found that a neural network could both achieve a better classification as well as take less time to train.
As the utilisation of deep learning has been a relatively recent addition to the field of machine learning, only few papers have considered their use in network traffic classification. Wang [1] showed that utilising the first one thousand bytes of a network flow’s payload could prove an effective input for a deep (multilayered) neural network. Although the paper was lacking in overall detail about their implementation and information about the data set used, the results showed promising results for the utilisation of deep learning in network traffic classification.
Background
Network Flows
The term ‘flow’ can be thought of as conversation between two end points on a network. These two end points will exchange packets with each other until this conversation ends. Packets consists of two sections, the header section which holds information about the packet (e.g., destination and source address), and the payload section which holds the message that will be delivered to the recipient. An individual flow has been defined as the unidirectional or bidirectional exchange of packets containing the same five key properties [10]:
- Destination IP address
- Source IP address
- Destination port address
- Source port address
- Transport layer protocol (TCP or UDP).
Machine Learning and Deep Learning
ML is a subset of artificial intelligence (AI) that uses pattern recognition to classify or make predictions from a given set of data. This project will focus on one area of ML in particular, supervised learning. Supervised learning is a type of learning algorithm which uses the desired output (in the case of this report, the correct classification of network traffic) to assess the performance of a model and make corrections based upon the difference between its prediction and the correct classification [11].
Artificial neural networks (ANN) are a subset of machine learning that are formed on the basis of designing mathematical models to mimic how biological neural networks compute information. It should be noted that it is customary to drop “artificial” when discussing ANNs in the context of ML, this convention will be adopted for the rest of this paper. There are many different types of NNs but the foundation of which, is a network of simple processors, referred to as nodes or neurons, which communicate through numerous connections to other nodes within the network [18] (refer to Figure 2). NN are structured in layers; it is these layers that hold the distinction between deep neural networks and other NNs. A typical NN will be divided into three layers, the input, hidden, and output layer. When a NN is made up of multiple hidden layers it is said to be a Deep Neural Network (DNN). Deep learning is the process of using these deep neural networks for machine learning.
References
[1] Z. Wang, "The Applications of Deep Learning on Traffic Identification," Black Hat USA, 2015.
[2] S. Zeba and D.G. Harkut, "An overview of network traffic classification methods," International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), no. ISSN: 2321-8169, pp. 482 - 488, February 2015.
[3] T. Auld, A. W. Moore, and S. F. Gull, "Bayesian neural networks for internet traffic classification," IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 223-39, Jan 2007.
[4] A. W. Moore and K. Papagiannaki, "Toward the Accurate Identification of Network Applications," in PAM, 2005, vol. 5, pp. 41-54: Springer. An intensive examination of network traffic and methods to classify them.
[5] L. Deri, M. Martinelli, T. Bujlow, and A. Cardigliano, "nDPI: Open-source high-speed deep packet inspection," in 2014 International Wireless Communications and Mobile Computing Conference (IWCMC), 2014, pp. 617-622. A deep packet inspection tool utilise to add additional labels to the UNSW-NB15 data set.
[6] V. Carela-Español, P. Barlet-Ros, A. Cabellos-Aparicio, and J. Solé-Pareta, "Analysis of the impact of sampling on NetFlow traffic classification," Computer Networks, vol. 55, no. 5, pp. 1083-1099, 2011. The “UPC” data set, used by both the 2014 and 2016 iterations of this project.
[7] B. McAleer et al., "Honours Project 10: Development of Machine Learning Techniques for Analysing Network Communications," The University of Adelaide, Adelaide 2014. 2014’s iteration of the project. Investigated network traffic with tree based and SVM classifiers.
[8] K. Hörnlund, J. Trann, H. G. Chew, C. C. Lim, and A. Cheng, "Classifying Internet Applications and Detecting Malicious Traffic from Network Communications," ECMS, The University of Adelaide, 2016. Last years’ iteration of the project. Explored network traffic classification utilising graph based techniques.
[9] C. Trivedi, M.-Y. Chow, A. A. Nilsson, and H. J. Trussell, "Classification of Internet traffic using artificial neural networks," 2002. Used the packet size for classification. Showed a neural network was favourable compared to a standarised clusting technique.
[10] J. Quittek, T. Zseby, B. Claise, and S. Zander, "Requirements for IP Flow Information Export (IPFIX)," Internet Engineering Task Force. (IETF), October 2004. Used to understand the requirements for IPFIX designation.
[11] A. Ng. Machine Learning. Available: https://www.coursera.org/browse/datascience/machine-learning A free course on machine learning. Was used initially to get fimilar with the concepts.