Projects:2014S1-10 Development of Machine Learning Techniques for Analysing Network Communications
This project involves research and development of machine learning techniques for application in the characterisation of communication networks and their traffic.
As networks become larger and more diverse, it is no longer feasible to examine every data packet in order to perform network monitoring (or characterisation) functions. Instead, in large enterprise networks, network data collected at various access points (e.g. routers) is summarized into 'flows-based' (i.e. NetFlow/IPFIX) traffic datasets.
These 'flows' of network communications capture vital statistics of the types of network traffic from one IP-based device to another.
The end-goal of the machine learning prototypes is to facilitate further characterisation and analysis of enterprise IP-based networks for network monitoring, management, optimization, and cyber malicious/anomaly detection use cases.
Contents
Project information
This project involves research and development of machine learning techniques for application in the characterisation of communication networks and their traffic. As networks become larger and more diverse, it is no longer feasible to examine every data packet in order to perform network monitoring (or characterisation) functions. Instead, in large enterprise networks, network data collected at various access points (e.g. routers) is summarized into 'flows-based' (i.e. NetFlow/IPFIX) traffic datasets. These 'flows' of network communications capture vital statistics of the types of network traffic from one IP-based device to another. By devising new data mining methodologies and adapting existing machine learning techniques, students shall investigate how to extract useful information from these flows of traffic data. In particular, what characteristics of the network can be 'learned' using machine learning : E.g. what type of traffic (http, https, mail, ftp, gaming, malicious, etc) was transacted between any two devices within (or external to) the network? - termed ‘traffic classification', or what type of devices (DNS server, mail server, online-game server botnet-slave/master, etc) can be identified throughout the network? - termed ‘device characterisation'. This problem is further complicated by ephemeral ports and non-standard usage of well-known ports by various types of network traffic. Depending on the chosen number of students, the range of potential machine learning techniques may include Tree/Forest-heuristics, Bayes-network, or Support-Vector-Machine based supervised learning, or clustering orientated unsupervised learning (or some combination of both methods). Students shall prototype their techniques on both real-life network datasets, and data extracted from a traffic generator using networks artificially conceived by students themselves. The end-goal of the machine learning prototypes is to facilitate further characterisation and analysis of enterprise IP-based networks for network monitoring, management, optimization, and cyber malicious/anomaly detection use cases.
Outline of proposed work
- Gain familiarity with using the BreakingPoint traffic generator to create datasets for anomalous traffic detection (e.g. injection of malicious traffic) (conducted by all students?).
- Compare all machine learning approaches using sub-sampled datasets. E.g. start with complete traffic flow datasets and then investigate the impact of sub-sampling the traffic flow summaries in the ratios of 1:10, 1:100, 1:1000 and the impact this has on detection and classification false positives etc (conducted by all students).
- Examine hybrid ML methods, boosting/bagging, semi-supervised schemes with ‘seeding’ and clustering. In contrast, what if a simplistic method/model based on port numbers only is applied?
Group M (Ben McAleer & Terry Moschou)
- Evaluate the capability of the BreakingPoint tool to simulate different network topologies (e.g. edge/backbone/etc type network traffic), and generate flows-based traffic of diverse application types (e.g. web, mail, DNS, anomalous, etc). The data provided by BreakingPoint shall be used in prototyping machine learning methods for device characterisation and traffic classification.
Search/investigate how others have used BreakingPoint for similar outcomes (i.e. any online forums, re-usable libraries/IP, or contact Ixia/BreakingPoint directly?)
- Investigate how representative the synthetic BreakingPoint flow traffic is compared to real traffic. If possible compare synthetic data with real open source datasets and captured University network traffic and determine how possible it is to create real world traffic statistics for varying scales of networks using the BreakingPoint tool suite. Perform cross-model evaluations.
- Conduct literature survey of existing network traffic classification (and device characterisation) techniques.
- Using the Weka tool as a machine learning platform, evaluate existing classification/characterisation techniques to assess their effectiveness/performance. The BreakingPoint and other datasets shall be used to cover the range of relevant (i.e. baseline) network scenarios.
- Extending from the above tasks, devise new or adapt existing machine learning techniques to target other types of network traffic and devices (i.e. including and beyond those of http web, DNS, p2p, secure-based, etc application types and devices). This work will require identifying (and evaluating) new forms of machine learning features, and may initially be based on two papers (by Iliofotou et. al.) using network graph based metrics as features for traffic classification.
- Investigate the potential of newly developed or existing methods above for anomalous traffic classification and malicious device identification (e.g. botnet-C2-slave/master devices). Examine the concept of ‘polymorphic blending’ in flows-based traffic – i.e. is it possible (and how effective is it) to disguise malicious traffic as benign traffic behaving normally throughout the network? (use BreakingPoint as the platform for this?) And if ‘polymorphic blending’ is plausible, how can machine learning methods tackle this?
This will require search/survey of existing papers tackling malicious flows-based traffic detections using ML, and implementing/extending their techniques.
Group C (Garrett Cullity & Vivian Cao)
- Search for online repositories of network traffic data. Examine and convert the data into labelled Netflow/IPFIX datasets.
- Conduct literature survey in the use of use of support-vector-machine (SVM) techniques for traffic classification (and device characterisation?).
- Devise (or adapt) techniques based on SVMs to perform traffic classification (and extend to conduct device characterisation). Establish if SVMs are effective for traffic classification (and what benefits do they provide over other methods examined other students in the group). E.g. are they more resistant to noisy feature sets and less susceptible to over-training as claimed/suggested by certain papers? – How do SVMs compare to other methods such as Random Forrest?
- Examine the use of SVM based methods for anomalous traffic classification and malicious device identification (as described in the final bullet-point of student 1 and 2’s description above).
High-Level Plan
- Phase 1 :
- Create/Gather datasets for device and traffic classification
- Background reading and literature survey
- Phase 2 :
- Benchmark common/relevant machine-learning methods using data from Phase 1.
- Phase 3 :
- Devise techniques (from those of Phase 2) to target specific devices or traffic flows, with a specific cyber malicious attack in mind.
Team
Group members
- Ms Vivian Cao
- Mr Garrett Cullity
- Mr Benjamin McAleer
- Mr Terry Moschou
Supervisors
- A/Prof Cheng-Chew Lim
- Dr Adriel Cheng
- Dr Hong Gunn Chew
Resources
- Bench 18 in Projects Lab