Share by e-mail

Return to the home page of Tomasz Bujlow

Classification and Analysis of Computer Network Traffic

Tomasz Bujlow

PhD Thesis, Networking & Security, Department of Electronic Systems, Aalborg University, June 2014, ISBN 978-87-71520-30-9.

  Download this publication in PDF (author's version)


Traffic monitoring and analysis can be done for multiple different reasons: to investigate the usage of network resources, assess the performance of network applications, adjust Quality of Service (QoS) policies in the network, log the traffic to comply with the law, or create realistic models of traffic for academic purposes. We define the objective of this thesis as finding a way to evaluate the performance of various applications in a high-speed Internet infrastructure. To satisfy the objective, we needed to answer a number of research questions. The biggest extent of them concern techniques for traffic classification, which can be used for nearly real-time processing of big amounts of data using affordable CPU and memory resources. Other questions are related to methods for real-time estimation of the application Quality of Service (QoS) level based on the results obtained by the traffic classifier. This thesis is focused on topics connected with traffic classification and analysis, while the work on methods for QoS assessment is limited to defining the connections with the traffic classification and proposing a general algorithm.

We introduced the already known methods for traffic classification (as by using transport layer port numbers, Deep Packet Inspection (DPI), statistical classification) and assessed their usefulness in particular areas. We found that the classification techniques based on port numbers are not accurate anymore as most applications use dynamic port numbers, while DPI is relatively slow, requires a lot of processing power, and causes a lot of privacy concerns. Statistical classifiers based on Machine Learning Algorithms (MLAs) were shown to be fast and accurate. At the same time, they do not consume a lot of resources and do not cause privacy concerns. However, they require good quality training data. We performed substantial testing of widely used DPI classifiers (PACE, OpenDPI, L7-filter, nDPI, Libprotoident, and NBAR) and assessed their usefulness in generating ground-truth, which can be used as training data for MLAs. Our evaluation showed that the most accurate classifiers (PACE, nDPI, and Libprotoident) do not provide any consistent output -- the results are given on a mix of various levels: application, content, content container, service provider, or transport layer protocol. On the other hand, L7-filter and NBAR provide results consistently on the application level, however, their accuracy is too low to consider them as tools for generating the ground-truth. We also contributed to the open-source community by improving the accuracy of nDPI and designing the future enhancements to make the classification consistent.

Because the existing methods were shown to not be capable of generating the proper training data, we built our own host-based system for collecting and labeling of network data, which depends on volunteers and, therefore, was called by us Volunteer-Based System (VBS). The client registers the information about all the packets transferred through any network interface of the machine on which it is installed. The packets are grouped into flows, which are labeled by the process name obtained from the system sockets. The detailed statistics about the network flows give an overview how the network is utilized. The data collected by VBS can be used to create realistic traffic profiles of the selected applications, which can server as the training data for MLAs.

We assessed the usefulness of C5.0 Machine Learning Algorithm (MLA) in the classification of computer network traffic. We showed that the application-layer payload is not needed to train the C5.0 classifier to be able to distinguish different applications in an accurate way. Statistics based on the information accessible in the headers and the packet sizes are fully sufficient to obtain high accuracy. We also contributed by defining the sets of classification attributes for C5.0 and by testing various classification modes (decision trees, rulesets, boosting, softening thresholds) regarding the classification accuracy and the time required to create the classifier.

We showed how to use our VBS tool to obtain per-flow, per-application, and per-content statistics of traffic in computer networks. Furthermore, we created two datasets composed of various applications, which can be used to assess the accuracy of different traffic classification tools. The datasets contain full packet payloads and they are available to the research community as a set of PCAP files and their per-flow description in the corresponding text files. The included flows were labeled by VBS.

We also designed and implemented our own system for multilevel traffic classification, which provides consistent results on all of the 6 levels: Ethernet, IP protocol, application, behavior, content, and service provider. The Ethernet and IP protocol levels are identified directly based on the corresponding fields from the headers. The application and behavior levels are assessed by a statistical classifier based on C5.0 Machine Learning Algorithm. Finally, the content and service provider levels are identified based on IP addresses. The system is able to deal with unknown traffic, leaving it unclassified on all the levels, instead of assigning the traffic to the most fitting class. Our system was implemented in Java and released as an open-source project.

Finally, we created a method for assessing the Quality of Service in computer networks. The method relies on VBS clients installed on a representative group of users from the particular network. The per-application traffic profiles obtained from the machines belonging to the volunteers are used to train the C5.0 Machine Learning based tool to recognize the selected applications in any point of the network. After the application is being identified, the quality of the application session can be assessed. For that purpose, we proposed a hybrid method based on both passive and active approaches. The passive approach can be used to assess jitter, burstiness, download and upload speeds, while the active one is needed when we want to measure delay or packet loss.

Return to the home page of Tomasz Bujlow