节点文献

因特网流量类不平衡特性与分类方法的研究

Studying Class Imbalance Characteristics and Classification Methods on Internet Traffic Flows

【作者】 刘珍

【导师】 刘琼;

【作者基本信息】 华南理工大学 , 计算机应用技术, 2013, 博士

【摘要】 因特网(Internet)流量分类是实施网络管理、服务质量保障、网络计费以及网络安全等的重要基础。传统的流量分类方法难以适应因特网应用的快速发展,基于机器学习的流量分类方法具有良好的应用前景。但是,这类方法通常以获得高总体分类准确率为优化目标,尚未顾及因特网流量数据所具有的多类不平衡特性,致使分类性能往往偏向大类,而忽略小类。在因特网流量中,某些小类应用多涉及命令流、实时通信流等,其分类性能关乎通信的可靠性或用户体验,有的小类属于重量级应用,其分类性能关乎网络规划或带宽资源分配等。目前,因特网流量的类不平衡特性及分类方法缺乏系统研究。论文针对因特网流量数据集,就选定的特征空间,观察分析网络流样本的类分布特性,分析其特点,从数据重采样、特征选择和分类算法三个方面展开因特网流量分类方法的研究。论文的主要贡献如下:(1)因特网流量数据的类不平衡特性。论文从表象和内在两个方面剖析流量数据存在的类不平衡特性。比较各类别的网络流数目和字节数目,发现流量数据往往包括多个大类和多个小类,大类与小类之间的流数目差距显著,小类可能拥有较大比例的字节数,类内还可能存在大流与小流之间的显著不平衡。观察分析网络流样本在选定特征空间的分布特性,认识到同类流样本往往分布于多个子概念区域,某些子概念仅包含少量的流样本,类间流样本多存在重叠现象。研究类不平衡特性对流量分类性能的影响,发现多子概念特性对流量分类性能的影响比类间流数目不平衡或类间重叠更显著。(2)适合因特网流量多小类特性的代价敏感学习算法。当采用代价敏感学习算法处理流量数据的类不平衡问题,基于流比率的错分代价矩阵不适合因特网流量数据的困难小类(训练的流样本不致最少,但流量难以被正确分类的小类)。论文利用加权方式控制错分代价矩阵,即分析错分代价增长空间与类不平衡程度之间的关系,提出类不平衡程度评估指标和权重计算方法,以适度增加困难小类的错分代价而基本不损失大类的分类性能。(3)因特网流量数据的重采样方法。针对因特网流量数据可能存在的类间流数目不平衡、类间重叠、多子概念和小析取项等问题,提出分层式数据重采样方法PSC(partition, sampling and combining),首先将原始流量数据集划分为多个不相交且密集的子集,以减少类内子概念数;针对每个子集中的小类流样本特征值,以随机插值法扩充小类流样本,进而处理小析取项;并在每个子集上,移除大类与小类重叠区域的大类流样本,进而缓解类间重叠。PSC方法为子分类器训练建立类内散度、类间重叠程度和类不平衡程度均较低的训练子集。(4)因特网流量统计特征的选择算法。针对因特网流量数据可能存在类内多子概念、类间重叠和多小类,提出平衡式特征选择算法BFS (balanced feature selection)。为选择出使得单类流样本具有较低离散度的特征,提出局部相关性指标,用于评估单特征在单类流样本上的确定性程度。为选择出使得类间流样本具有较低重叠程度的特征,采用全局相关性指标评估特征对类别变量的确定性程度。基于每个特征的局部与全局相关性,为每个类别选择局部相关且全局区分性较强的特征,以保证选出的特征子集有利于区分多个小类。(5)因特网重型流分类方法。在因特网流量中,类内的大流与小流不平衡可能使分类器忽略大流的学习;类间流数目不平衡可能使分类器忽略拥有高字节数的小类的分类性能。两种情况均可能导致重型流分类困难,得到低字节分类性能。针对大流与小流不平衡,提出基于信息增益率的流尺度模块化方法(flow size modularization based oninformation gain ratio,FSMGR)。FSMGR以最小化大流集合的数据复杂度为目标搜索大流与小流的划分阈值,将原始流量数据集划分为大流和小流子集,并分别用于分类器训练,从而强化了大流的学习。针对类间流数目不平衡,改进(3)中提出的PSC重采样方法,在保留重型流的情况下缓解小类与大类之间的不平衡,并结合Boosting集成学习算法提高分类器的稳定性。

【Abstract】 Internet traffic classification is an important foundation for performing networkmanagement, quality of service guarantee, network accounting and network security etc.Traditional traffic classification methods difficultly accommodate the rapid developing ofnetwork applications. Internet traffic classification using machine learning (ML) is apromising alternative. However, the traffic classifier is always optimized to obtain highoverall classification accuracy, which does not take into account the class imbalance propertyof Internet traffic datasets. The traffic classification performance always biases towards themajority class and ignore the minority class. On Internet traffic, some minority classes containsignaling flows or real-time communication flows, and their classification performanceinfluences communication quality and user experience etc. Some minority classes own a lot ofbytes, and their classification performance affects network planning or bandwidth resourcesallocation etc.At present, there is lacking of systematic research on the class imbalance characteristicsand classification methods in Internet traffic classification. This paper observes the classdistribution of Internet traffic datasets on selected feature space and analyzes the imbalancecharacteristics, and then carries out researches on Internet traffic classification methods fromdata resampling, feature selection and classification algorithm. The main contributions are asfollows.(1) Class imbalance characteristics of Internet traffic datasets. This paper studies theclass imbalance characteristics of Internet traffic datasets from external and internal aspects.By comparing the flow number and byte number of each traffic class, this paper found thattraffic datasets usually contain multiple majority classes and multiple minority classes, thereis a big distance between the flow number of the majority class and that of the minority class,the minority class may own a lot of bytes and there is obvious imbalance between large flowsand small flows in some classes. The distribution of flow samples in the feature space showsthat the flow samples from the same class usually have several sub concepts and some subconcepts only have a small number of flow samples, and the flow samples of a class overlapthose of other classes. The research of the influence of class imbalance characteristics onInternet traffic classification performance shows that multiple sub concepts is more closelycorrelated to the classification performance when compared to flow number imbalance andclass overlapping.(2) Cost-sensitive learning for the traffic datasets with multiple minority classes. When cost-sensitive learning algorithm is applied to classify traffic flows, the flow rate based costmatrix does not fit the difficulty classes with more flows but difficultly identified. This paperutilizes weights to improve the cost matrix. Through analyzing the relationship between theclass imbalance degree and the room of increasing misclassification cost, an evaluation metricfor class imbalance degree and the calculation method for weight are proposed. The methodaims to properly increase the weights of difficulty clases without decreasing the classificationperformance of the majority class significantly.(3) Data resampling method for Internet traffic datasets. A traffic dataset may existseveral imbalance related factors i.e. flow number imbalance, class overlapping, multiple subconcepts and small disjuncts. To handle these problems simultaneously, a hierarchical dataresampling method named PSC (partition, sampling and combining) is proposed. Firstly, anorigin traffic dataset is partitioned into multiple disjoint and dense subsets to reduce subconcepts. And over sampling is performed on each cluster, which handles small disjuncts inthe way of enhancing flow samples for minority classes. Then, a heuristic under samplingmethod is performed on each class, in which rules for removing majority class flow samplesare devised, so as to alleviate class overlapping. PSC can build sub training set with lowerwithin-class dispersion, class overlapping and class imbalance.(4) Selection algorithm for Internet traffic flow features. Considering the multiple subconcepts, class overlapping and multiple minority classes, a balanced feature selection (BFS)algrithm is proposed. In order to select the features that make flow samples with lowerdispersion, a local correlation metric is proposed to evaluate the certainty of a feature on theflow samples of a class. In order to select the features that make flow samples of differentclasses with lower overlapping, a global correlation metric is applied to evaluate the certaintyof class variable when a feature is given. Based on the evaluation results of local and globalcorrelation of each feature, a search algorithm is proposed, which selects a local correlationfeature for each class and the feature also has high global discrimination power. So that, theselected feature subset includes the features that are advantageous to discriminate minorityclasses.(5) Classification methods for large flows. The imbalance between large flows and smallflows exists in some classes, which may result into that the classifier ignores the learning oflarge flows. The flow number imbalance between the minority class and the majority classmay result into that the classifier ignores the classification performance of the minority classwith a lot of bytes. Both of the two cases may lead to difficultly classifying large flows andobtaining low byte accuracy. For handling the imbalance between small flows and large flows, a flow size modularization method based on information gain ratio (FSMGR) is proposed.Taking the object of minimizing the data complexity of large flows, it searches a partitionthreshold (correlated to bytes). The origin traffic training set is partitioned into large and smallflow sub sets according the partition threshold, each of which is individually used to train aspecific classifier. So that the large flows are emphasized and the classification problembecomes easier. For handling the imbalance between the minority class and the majority class,the PSC in (3) is improved (named BPSC) to alleviate the flow number imbalance whileretaining all large flows and the boosting ensemble learning algorithm is used to improve thestability of the classifier.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络