ABSTRACT: Peer-to-Peer (P2P) applications utilize significant network resources, resulting in network congestion, affecting the network availability and quality of services. As such, telecom equipment vendors and Internet Service Providers are interested in efficient solutions to classify P2P traffic for further control.
In this presentation, we discuss the challenges of applying machine learning and data mining techniques to classify P2P traffic. P2P applications generate streaming data in large volumes. New communities of peers are regularly formed, and existing communities of peers may dissolve, requiring the classifiers to deal with the concept drift and to updating the models incrementally. Also, our observations confirmed that the traffic data is class imbalanced, biasing the models towards the majority class. Moreover, we observed that only about 25% of samples can be labeled as “P2P”or “NonP2P” using a port-based heuristic rule. We expect that even fewer samples can be labeled in the future as more P2P applications use dynamic ports. This calls for the techniques enhancing the accuracy of traffic classification by exploiting the unlabeled samples.
We propose a new technique, the imbalanced Concept-adapting Very Fast Decision Tree (iCVFDT,) to address the issue of the class imbalanced data. The iCVFDT classification technique was applied to a real data set with 3.5 million samples to demonstrate the significant improvement in its performance compared to the CVFDT. We also propose an incremental Tri-Training (iTT) algorithm to exploit unlabeled samples. We verified the performance of the algorithm on a real dataset with 7.2 Mega labeled samples and 20.4 Mega unlabeled samples. We extracted attributes only from the IP header, eliminating the privacy concern associated with the techniques using deep packet inspection.
|