Tweet | Share by e-mail |
Return to the home page of Tomasz Bujlow
Technical Report, Department of Electronic Systems, Aalborg University, February 2014.
Download this publication in PDF (author's version)
Abstract
Classification and accounting of computer network traffic is an important task of Internet Service Providers, as it allows for adjusting the bandwidth, the network policies, and providing better experience to their customers. However, existing tools for traffic classification are incapable of identifying the traffic in a consistent manner. The results are usually given on various levels for different flows. For some of them only the application is identified (as HTTP, BitTorrent, or Skype), for others only the content (as audio, video) or content container (as Flash), for yet others only the service provider (as Facebook, YouTube, or Google). Furthermore, Deep Packet Inspection (DPI), which seems to be the most accurate technique, in addition to the extensive needs for resources, often cannot be used by ISPs in their networks due to privacy or legal reasons. Techniques based on Machine Learning Algorithms (MLAs) require good quality training data, which are difficult to obtain. MLAs usually cannot properly deal with other types of traffic, than they are trained to work with - such traffic is identified as the most probable class, instead of being left unclassified. Another drawback of MLAs is their inability to detect the content carried by the flow, or the service provider.
To overcome the drawbacks of the already existing methods, we developed a novel hybrid method to provide accurate identification of computer network traffic on six levels: Ethernet, IP protocol, application, behavior, content, and service provider. The Ethernet and IP protocol levels are identified directly based on the corresponding fields from the headers (EtherType in Ethernet frames and Type in IP packet). The application and behavior levels are assessed by a statistical classifier based on C5.0 Machine Learning Algorithm. Finally, the content and service provider levels are identified based on IP addresses. The training data for the statistical classifier and the mappings between the different types of content and the IP addresses are created based on the data collected by Volunteer-Based System, while the mappings between the different service providers and the IP addresses are created based on the captured DNS replies. Support for the following applications is built into the system: America's Army, BitTorrent, DHCP, DNS, various file downloaders, eDonkey, FTP, HTTP, HTTPS, NETBIOS, NTP, RDP, RTMP, Skype, SSH, and Telnet. Within each application group, we identify a number of behaviors - for example, for HTTP, we selected file transfer, web browsing, web radio, and unknown. Our system built based on the method provides also traffic accounting and it was tested on 2 datasets.
The classification results are as follows. On the Ethernet and IP protocol levels, we achieved 0.00% errors. The classification on the application and behavior levels were assessed together. Using the first dataset, we achieved 0.08% of errors, while 0.54% of flows remained as unknown. Using the second dataset, we achieved 0.09% of errors, while 0.75% of flows remained as unknown. Taking into account the content level, the classification using the first dataset gave us 0.22% errors and 0.47% of unclassified flows, while using the second dataset it gave us 0.96% of errors and 1.42% of unclassified flows. The classification on the service provider level was performed only using the first dataset (we needed the application-layer payloads) and it gave us 1.34% of errors and 1.71% of unknown flows. Therefore, we have shown that our system gives a consistent, accurate output on all the levels. We also showed that the results provided by our system on the application level outperformed the results obtained from the most commonly used DPI tools. Finally, our system was implemented in Java and released as an open-source project.
Return to the home page of Tomasz Bujlow