节点文献

P2P流量识别关键技术研究

Research on Key Identification Method of P2P Traffic

【作者】 彭建芬

【导师】 涂序彦;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2011, 博士

【摘要】 网络流量识别是管理大型网络的一个重要任务,同时也是合法截留方法的主要组成部分。随着网络技术的快速发展与广泛应用,许多新的P2P应用层出不穷。P2P应用技术资源利用率高、信息存储的非中心化等特点使得P2P技术在文件共享、分布式计算、协作系统和电子商务中应用广泛。随着P2P应用的不断增多,P2P流量所占网络流量的比重越来越大,国内P2P流量占总流量的70%以上,准确地识别网络中P2P应用的流量对网络规划设计、QoS保证等都有十分重要的作用。另一方面,P2P应用的网络软件设计缺陷使得攻击者易于发起庞大的拒绝服务攻击,从而使得互联网网站轻易地崩溃。P2P网络分散式的存储结构、方便的共享原理和快速的选路机制,有利于木马、病毒等破坏性程序的传播。为了保证网络的正常运行,需要对P2P流量进行快速、准确地识别。目前P2P技术采用动态端口技术和载荷加密技术逃避基于端口和基于应用载荷签名的P2P流量识别算法的检测。当今普遍研究的流量识别算法是基于行为特征的流量识别算法和基于机器学习的流量识别算法。本文提出的P2P流量早期快速识别算法和改进的启发式P2P流量识别算法术分别属于基于机器学习的P2P流量识别算法和基于行为的P2P流量识别算法。P2P流量早期快速识别算法利用监督的机器学习算法对流初期几个包提取的特征进行分类,识别正确率高,适合于对P2P流及具体的P2P应用的早期识别。改进的快速启发式P2P流量识别算法利用P2P流与非P2P流之间在传输层表现出的不同,能快速地识别出P2P流以及P2P部分具体流行的应用。最后本文研究了P2P应用主机TCP流的连接特性和自相似性。本文的主要研究工作包括以下几个方面:(1)为了对P2P的TCP数据流进行及时、快速并准确地识别,起到对P2P流量预警和控制的作用,本文提出了一种基于SVM的TCP流量早期识别算法。该算法根据不同应用流的包到达的实际情况,利用TCP流初期的三个数据包的载荷大小和服务器端口作为流量特征,利用支持向量机的高斯径向基核函数进行一对一多类分类。实验结果比较和分析表明:根据提取的特征,采用无偏训练样本,选择合适的参数能快速而有效地识别WEB、MAIL、P2P中的BitTorrent和eMule流量,这种早期流量识别算法的特征值的得到无需等待流的结束,特征提取简单。由于提取的特征不涉及到协议签名,因此早期流量识别算法对加密流量或伪装特性的业务流量识别同样适用;(2)为了减少建模的时间和提高分类的正确率,在基于SVM的TCP流量早期识别算法的基础上,提出了基于C4.5决策树的P2P流量早期快速识别算法。分类结果比较和分析表明:相对于其它两种分类算法,C4.5决策树进行分类时识别正确率高,分类速度快。因此这种早期快速识别算法利用TCP流初期的三个数据包的载荷大小和服务器端口作为特征能快速有效地识别出WEB、MAIL、P2P中的BitTorrent和eMule流量;(3)为了提高Karagiannis等人提出的P2P流启发式算法的识别正确率,利用端口4662、有效数据流的计数原理、BitTorrent对等协议握手消息数据包的载荷大小固定特点以及Skype流的包载荷特点对其进行改进,提出了一种改进的快速P2P流量启发式识别算法。实验结果比较和分析表明:在识别P2P流和Non-P2P流时,选择合适的对等点阈值,能有效识别出P2P流以及P2P流对应的部分具体应用;(4)为了识别出P2P应用主机,对P2P应用主机TCP流从连接特性和自相似性两个方面进行了研究。P2P系统的主机扮演双重角色:服务器和客户端。非P2P系统的连接模式采用传统的客户/服务器模式,发起连接时以很高的连接成功率进行,与之相反的是,由于P2P系统的动态性,P2P主机不断地向其它在线主机发起连接以保证稳定的下载速度。与系统动态性和连接成功率相关的参数为:传输的SYN包数、传输的SYN+ACK、传输的SYN包不同目的地址数、接收的SYN+ACK包的不同源地址数包数、传输的SYN包不同目的端口数、接收的SYN+ACK包的不同源端口数。实验结果比较和分析表明:在识别P2P和非P2P传统应用主机的TCP流时,利用后四个参数比利用六个参数作为流量特征有效。主机流量的自相似性从时间上和行为上进行了分析,行为上的自相似性研究表明P2P应用主机在收到一定数量的数据包后,其数据包载荷变化很小。

【Abstract】 The internet traffic identification is one of the crucial tasks for the large network management and the major component of the lawful interception. With the rapid development and wide application of the network technology, more and more applications based on the peer-to-peer (P2P) protocols appear. The characteristics of the P2P techniques, including the high utilization of resources and the non-centralized storage requirement, which accelerate the application of itself in file-sharing, distributed computation, collaborative systems and e-commerce. Since more and more network bandwidth is occupied by the large-scale P2P applications, more than 70% of the whole traffic in China, it is emergent to identify the P2P traffic for the QoS guarantee in the plan and design of network. Meanwhile, the existing vulnerabilities of the P2P applications cause them be easily attacked by the denial of service attacks and intensify the collapse of the Internet. Actually, it is the inherent characteristics that facilitate the spread of the Trojans, viruses and other destructive programs, for instance, the decentralized network storage structure, the principle for convenient file-sharing and the fast routing mechanism. Therefore, to ensure the normal operations of the network, it is urging to identify the P2P traffic quickly and accurately.However, the popular P2P techniques prefer to employ the technologies of dynamic port and encrypted payload to evade either the port-based or the signature-based P2P traffic identification. Currently, the state-of-the-art traffic identification techniques are based on either the network behavior or the machine learning. In this paper, the early and fast P2P traffic identification method and the improved fast identification method of P2P traffic based on heuristics are respectively belonging to the traffic identification technologies based on the machine learning and the behavior. The early traffic identification algorithm uses the size of the first three packets and the server port number extracted from the TCP flows as the features and conducts the supervised learning for classifying the traffic, it can achieve the high accuracy, thus it is suitable for early P2P traffic identification. Improved fast identification method of P2P traffic based on heuristics uses the differentiation between P2P flow and non-P2P flow at the transport layer, which can quickly identify P2P traffic and the specific application of the popular P2P applications. Finally, TCP traffic of P2P application host on the responds success rate and self-similarity are analyzed.The main contributions of this paper are concluded as follows:1. In order to identify P2P traffic quickly and accurately as early as possible, early TCP traffic identification method based on support vector machines(SVM) is proposed for early warning and control of P2P traffic. The method uses the size of early three packets payload and server port number obtained from the TCP flow as flow features and conducts SVM using one against all classification strategy for classifying the traffic. Both theoretical analysis and experimental results show that the method meets the following conditions:extracted features used, training samples selected under the unbiased conditions, it can identify the Internet traffic into application among WEB, MAIL, BitTorrent and eMule categories efficiently. The extracted features are not related to packet payload, so the method is suitable for early identification of encrypted traffic.2. In order to reduce modeling time and improve classification accuracy, early and fast P2P traffic identification method based on C4.5 decision tree. Both theoretical analysis and experimental results show that the C4.5 decision tree has the following superiority compared to two other supervised machine learning algorithms in traffic identification: higher accuracy, computational time saved in traffic identification. Therefore, the method using the size of early three packets payload and server port number obtained from the TCP flow as flow features can quickly and effectively identify internet traffic related to WEB, MAIL, BitTorrent and eMule.3. In order to improve the accuracy and efficiency of transport layer P2P traffic identification method proposed by Karagiannis et al, the port 4662, effective counting mechanisms, the fixed size of BitTorrent peer protocol handshake message packet payload and the payload characteristics of Skype are used to improve the method, the improved fast identification method of P2P traffic based on heuristics is proposed. Both theoretical analysis and experimental results show that the accuracy and efficiency of improved identification method have improved. It can identify the P2P traffic and specific applications of the P2P traffic, such as BitTorrent, eDonkey, Skype.4. In order to identify P2P host, we study connection characteristics and self-similarity of host TCP traffic. P2P host acts as server and client. Non-P2P system connects using the traditional client/server model and achieves a high success rate, as opposed to that, P2P host constantly initiate connections to other online host to guarantee a stable download speed because of dynamic nature of P2P systems. Parameters associated with the dynamic of system and connection success rate include:number of transmitted SYN packets, number of transmitted SYN/ACK packets, number of different destination IPs of transmitted SYN packets, number of different source IPs of received SYN/ACK packets, number of different destination port of transmitted SYN packet, number of different source port of received SYN/ACK packets. Both theoretical analysis and experimental results show that the feature combination of the last four parameters outperforms the other combinations of features while being employed in the identification of P2P host TCP flows. The self-similarity of host TCP flow is analyzed under behavior scale and under time scale. We conclude the received payload of packets of host TCP only have little change after host receives a certain number of packets.

节点文献中: