节点文献

基于流聚类的网络业务识别关键技术研究

Research on the Key Technologies of Data Stream Clustering Based Network Service Identification

【作者】 李丹

【导师】 胡正名;

【作者基本信息】 北京邮电大学 , 信息安全, 2013, 博士

【摘要】 随着互联网的快速发展,网络业务应用类型呈现百花齐放的状态。这在提高了社会效率和丰富了人们精神生活的同时,也使得网络环境更加复杂化,大量的P2P业务占据了带宽资源,造成网络拥塞,运营商服务质量降低,安全问题日益突出。因此,迫切地需要实施网络管理和监控,优化网络资源,解决安全问题,提高网络传输能力,并为网络规划和扩容提供科学依据。网络业务识别技术正是支持网络管理与监控的基础和有效手段。如今,过分依赖于端口和数据包负载的传统网络业务识别技术已经无法应对复杂的网络环境。基于数据挖掘的网络业务识别技术提取网络业务流的统计信息对其进行分类或聚类处理,更适用于对现今环境下复杂的网络业务流量进行识别,因此成为网络业务流识别的重点研究方向之一考虑到网络业务流的数据流特性,本文致力于数据流聚类算法和网络业务识别方案的研究,主要内容和创新点如下:网格时间权重阈值自适应的任意形状数据流聚类方法研究:网格技术具有处理快速且处理时间只依赖于网格划分粒度的优点。针对网络业务流的分布在数据空间中具有任意形状,以及其在时间和空间上的倾斜特性,本文提出一种基于网格的任意形状数据流聚类算法。该方法基于衰减函数提出了潜在密集网格和离群网格的概念,定义了具有自适应能力的网格时间权重阈值,即体现了网络业务流的时间倾斜分布特性,又考虑了其空间倾斜分布特性;设计了在线维护算法来周期性地对两类网格进行检查和更新,删除退化网格,提高了聚类时的存储效率和时间效率。实验证明,算法能够很好的从噪声数据中识别任意形状且具有空间倾斜分布特性的簇,对网络业务流数据具有较好的聚类质量和较快的聚类速度。基于网格密度的数据流演化聚类分析方法研究:在对网络业务流的分析研究中,运营商往往不仅想了解某个时刻下的网络业务流量特性,更想知道某个时间段或某两段时间内网络业务流特性如何变化。本文提出一种基于网格密度的数据流聚类算法,使用数据点密度系数处理网络业务流数据的时间倾斜问题,定义以网格密度为核心的网格特征向量以减少内存空间占用,使用金字塔时间框架技术按照一定规则保存在线维护的网格集合快照,以实现对当前数据的聚类、对当前时间段内数据的聚类,以及对某段时间内数据流演变特性的分析。实验表明,该算法具备良好的噪声健壮性,能够基于不同的用户请求产生任意形状的最终聚类簇,具有良好的数据流演化分析能力,对网络业务流具有较好的聚类质量和较快的处理速度。基于流聚类的半监督多级网络业务识别方案研究:网络业务流中长短流比例的不平衡及其各自的不同特性使得单一的网络业务识别方法无法全面地顾及所有的网络业务流量。本文对TCP协议和UDP协议承载的网络流使用不同长短流判别标准,综合多种识别技术,提出一种在线多级的网络业务分流识别体系,联合基于端口、数据包负载和数据挖掘的方法对短流进行多级识别,使用基于数据挖掘的方法对长流进行识别。对基于传统数据挖掘的识别方法进行分析,基于传统分类方法的网络业务识别技术受限于学习分类器时使用的训练数据集,不适用于实时变化的网络业务流识别;基于传统聚类方法的网络业务识别技术能够发现数据的自然特性簇,但是多次扫描数据集的方式同样不适用于动态网络业务流的识别,聚类簇的分析也是研究难点之一。在充分考虑网络业务流特性的基础上,本文提出一种基于流聚类的半监督网络业务识别方案。该方案使用双层处理框架,实现对在线实时网络业务流的一次扫描;将产生的微簇存储至离线的时间快照数据库并按照一定的规则维护。离线宏聚类根据用户请求选择聚类算法和数据,产生最终聚类簇。本文提出根据实时数据流建立定时更新和维护映射规则数据库的方法,通过其他识别技术识别抽样流并建立对应微簇与网络应用类型的映射对,以辅助识别聚类簇的网络业务应用类型。此外对长流引入子流概念,提取子流的属性特征,选择出最佳特征子集应用于识别方案中。

【Abstract】 With the development of Internet, the number of network applications increase rapidly. It leads to the improvement of social efficiency and enrichment of people’s spiritual life, and also complicates network environment. Congestion occurs as network bandwidth resources are occupied by vast amounts of P2P traffic data, service quality reduces, and network security has become a serious problem. Hence there is an urgent need for implementation of network management and monitoring, which could optimize network resources, solve the security problems, improve network transmission capacity, and provide the scientific basis for the network expansion. Network service traffic identification technique is one of the effective methods to solve the problems mentioned above. However, traditional identification technologies rely excessively on traffic information of port number and packet payload, which has a negative influence on ability to deal with complex network traffic. Data mining-based identification technology extracts statistical information of network service traffic and classifies them by supervised or un-supervised method. It is more suitable for identifying complicated network traffic, and becomes one of the key research directions.Considering the data stream characteristics for network service flows, our researches concentrate on study data stream clustering algorithms and network service traffic identification scheme. The main contents and innovative points of this paper are as follows:Clustering for data streams with arbitrary shape based on adaptive time weight threshold of grid:grid technology is featured by high processing speed and the processing time which depends only on the size of grid. Given the arbitrary shape, tilt features of time and space for network data stream, the paper proposes a grid-based clustering algorithm for data streams with arbitrary shape. The algorithm introduces the concepts of potential dense grid and outlier grid based on fading function, and defines an adaptive time weight threshold of grid, which considers both tilt features of time and space for network service data stream. Online maintain function is designed to detect and delete ineligible grids periodically, which improves the storage and time efficiency. Experiments show that the algorithm can identify clusters with arbitrary shape and space tilt feature from noise data, and clustering network data stream with higher quality and speed.Evolution clustering for data streams based on grid-density:actually, users may not only want to know the characteristics of network data streams at the specific time, but also characteristics in specific time horizon or evolvements of network traffic between different periods. In this paper, a grid-density based clustering algorithm for evolving data streams is proposed. Density coefficient for data record is applied to deal with time tilt problem of network traffic. Pyramid time frame technology is introduced to save snapshot of grid set at the specific time. The algorithm has abilities of clustering at specific time, clustering in time horizon, and evolution analysis clustering. Experiments show that this algorithm has good robustness of noise, and perform better in data stream analysis and processing speed.Semi-supervised network service identification scheme based on data stream clustering algorithm:the application of single identification technology can not analyze network service traffic comprehensively because of the imbalance proportion and different properties of mice flow and elephant flow in network traffic. In this paper, we use different elephant thresholds to judge TCP flow and UDP flow, and propose a multi-level network traffic recognition system by combining various identification technologies. In this system, identification of mice flow is based on port, payload and data mining methods step by step, while identification of elephant flow is only based on data mining method. As to data mining based identification of network service traffic, traditional supervised method is limited by the training dataset which is used to the classifier learning, and is not suitable for real-time network traffic identification. Un-supervised method can find that nature clusters in traffic, but analysis for how to map clusters to each service application efficiently remains to be difficult to accomplish. Considering the features of network traffic sufficiently, this paper presents a semi-supervised network service traffic identification scheme based on data stream clustering algorithm. The scheme applies a two-phase framework, which implements single pass scan to process online real-time network traffic. It stores the micro-clusters set periodically to the offline time snapshots database. In response to user requests, offline component chooses clustering algorithm and related data from time snapshots database, and generates clusters. This paper maintains an offline mapping rules database, which is obtained through identifying sampled real-time traffic flows based on port number or payload identification techniques, and mapping the related micro-cluster to application type. In addition, the paper also using different elephant thresholds to get sub-flow from TCP/UDP elephant flow. Features of sub-flow are extracted, and the best feature subset is chosen by feature selection algorithm.

节点文献中: