节点文献

P2P流的测量与识别方法研究

Study of P2P Flow Measurement and Identification Method

【作者】 柳斌

【导师】 李芝棠;

【作者基本信息】 华中科技大学 , 计算机系统结构, 2008, 博士

【摘要】 P2P(Peer to Peer)是一种新的网络应用模式,其最大特点是P2P网络依靠网络边缘节点,而非中心节点实现自组织和资源共享。近些年来,P2P技术被广泛应用于文件共享、语音服务、流媒体、即时通信等领域。P2P技术在快速发展的同时也给网络管理带来了许多新的问题。如P2P应用消耗了大量带宽,版权纠纷以及安全问题等。P2P采用了动态端口,报文加密等逃避网络监测的技术,这使得传统的端口识别法对P2P流的识别不再有效。因此,研究有效的P2P流识别方法成为P2P流管理的重要课题。从典型的P2P系统测量,启发式识别方法,未知P2P流发现以及机器学习等四个方面对P2P流的测量和识别方法进行了研究。BitTorrent是目前广泛使用的混合式P2P系统的典型代表。从BitTorrent协议的主动测量,被动测量以及BitTorrent流的模型化三个方面展开了研究。首先提出了一种BitTorrent系统的主动测量方法,通过在BitTorrent节点中插入消息测量和状态测量模块,采集BitTorrent节点在下载过程中收发的各种消息以及状态变化信息,从协议内部对BitTorrent节点的下载过程进行观察。测量结果表明:BitTorrent节点下载时从群中少数节点获得大量数据,同时也主要向群中极少数节点上传数据,且下载量最多的节点通常也是上载量最多的节点。在被动测量研究中,提出了一种基于应用层特征的BitTorrent流实时测量方法。采用了流匹配测量框架,以流为基本单位进行匹配,设计了一种基于异或运算的哈希算法用于流匹配。利用应用层特征对BitTorrent报文进行识别,准确度较高。建立了测量算法的误报和漏报模型,并分析了特征报文与流长度的关系,提出应重点关注长流的应用层特征。分别提出了BitTorrent流流长和流时间间隔的分布模型。研究表明:BitTorrent流时间间隔可以用Weibull分布描述。BitTorrent流流长可以用Lognormal分布描述。对启发式P2P流识别方法进行了研究。提出了一种基于多行为特征的P2P主机识别算法。首先在分析P2P节点连接形式,远端地址分布以及端口行为特点的基础上,提取了双向连接率,IP地址随机测度和高端口连接率三种特征,然后通过阈值分类实现P2P流识别。实验表明该算法误报率低。与此同时,提出了一种基于支持向量机的P2P流应用级分类方法,利用支持向量机优良的分类性能,对不同应用类型的P2P流进行了分类。通过对BitTorrent,Emule,PPLive,PPstream 4种P2P流的分类实验,验证了方法的有效性,平均分类准确率为92.2%。对未知P2P应用的发现进行了研究。首先提出了一种基于多维聚类树的流分析方法(Multi-dimensional Clustering Tree,MCT)。该方法首先对流数据的每一维进行单维聚类,发现单维显著类。然后,构建多维聚类树发现多维显著类。MCT算法能自动挖掘网络中的显著流,描述显著流的多维属性,同时可以反映流量显著的IP子网。在MCT算法的基础上,提出了一种未知P2P流的识别方法。首先,利用P2P流的远端地址分布、双向性和高端口特征,定义了P2P流疑似度指标sp2p,对MCT算法挖掘出来的多维显著流进行P2P流疑似度判别。对高疑似度的P2P流,通过应用层特征匹配方法,去掉已知的P2P流,实现未知P2P流的识别。实验结果表明:MCT算法可以清楚了解网络流量的构成情况,利用sp2p能够有效识别出网络中流量较大的多种P2P应用。将机器学习方法应用到应用流的识别问题中,提出了一种基于熵函数的串联式特征选择算法。首先利用特征的后验概率分布来衡量特征对分类的有效性,接着采用顺序后退搜索方法,以分类器本身的分类准确率作为评估标准去除冗余特征。采用上述特征选择方法从Andew Moore数据集的249种特征中筛选出了11种分类特征。同时,提出了一种基于半监督聚类的应用流分类方法。首先采用粒子群优化的K均值聚类方法对混合数据进行聚类,然后利用少量标记数据确定簇与应用类型的映射关系,从而实现应用流分类。通过对Andew Moore数据集的实验表明:基于半监督聚类的应用流分类方法有较高的流识别准确率。

【Abstract】 P2P (peer-to-peer) is a new model of network application, which is characterized by relying on the edge node of the network, rather than center node to achieve self-organizing and sharing resources. P2P networks are typically used for file sharing, media streaming, instant communication etc. While P2P is in the rapid development in recent years, it also has brought many new problems for network management, such as much bandwidth occupying and network security. Since most of P2P applications are using dynamic random port numbers, data encryption, the traditional port matching technology has become useless for P2P flow identification. Research on P2P flow identification has become the most important problem of P2P flow management.In this paper, four research areas of P2P flow measurement and identification methods have been deeply studied, including the typical P2P system measurement, heuristic identification method, finding unknown P2P application, as well as machine learning method.BitTorrent is a recent, yet successful P2P protocol focused on efficient content delivery. To gain a better understanding of BitTorrent protocol, an active measurement system which modified BitTorrent client is designed. This method allows us to get detailed information on all exchanged messages and protocol events. Experimental evaluation showed that the peers from which the local peer download the most are also the peers that receive the most uploaded bytes. In the passive measurement study, a BitTorrent measurement method using application signature is present. The measurement framework included two parts, connection tracking and application-layer signature match. A hash algorithm for connection tracking based on XOR operation is provided. Matching BitTorrent application-layer signature, the method can identify BitTorrent flow accurately. BitTorrent flow length characteristics, and flows inter-arrival characteristics are analyzed. It is found that the BitTorrent flow’s inter-arrival distribution follows Weibull distribution, BitTorrent flow’s length distribution follows Lognormal distribution.The heuristic method for identifying P2P application has been studied. BEH algorithm which is a P2P host identification method based on the multiple characteristics is proposed. Firstly, several behaviors that are inherent to P2P flow are explored. These behaviors have been translated to metrics: the ratio of incoming and outgoing connections, remote hosts’ IP address entropy and the use of high ports. BEH which combined three individual metric together showed low false positive in experiment. A method to realize the P2P flow classification based on the support vector machine is proposed. Researches had been focused on four kinds of P2P application BitTorrent, Emule, PPLive and PPstream. The experimental results confirm the validity of proposed method, the average precise rate is 92. 2%.A new flow analysis method called MCT based on multi-dimensional clustering tree is proposed. Firstly, each dimensional of flow data is hierarchical clustered to identify the dominant flows. After mining the significant one-dimensional rules, using multi-dimensional clustering tree, these rules are combined to find significant multi-dimensional rules. An unknown P2P identification method based on MCT is present. According to entropy of IP, IP prefix and the two-way property of P2P flow, metric Sp2p is defined to identify P2P flow. The results show that: by multidimensional flow mining, the composition of current network traffic can be understood clearly. Moreover, the system is able to identify a variety of P2P flow which take up a large proportion of the total traffic.Machine learning techniques provide a promising method in classifying flows based on application protocol. A two-phase combined feature selection algorithm called ESBS is designed. In the first phase, a entropy method is used to filter the irrelevant features. In the second phase, backward sequential search algorithm is used to remove the redundant features with the performance of the induction algorithm. Using ESBS, 11 features have been selected from 249 features of Andew Moore datasets. A semi-supervised clustering method called PSOSC for the flow classification of application is proposed. Firstly, a novel Kmeans clustering algorithm based on Particle Swarm Optimization for a few labeled and many unlabeled flows had been present. Then, using a few labeled flows, clusters were mapped application. Experimental evaluation by Andew Moore datasets showed that high flow classification accuracy can be achieved.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络