节点文献

数据流集成分类器算法研究

Research on Data Stream Ensemble Classifiers

【作者】 杨显飞

【导师】 张健沛;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2011, 博士

【摘要】 随着信息技术的不断发展与应用,人们每天可以收集到大量高速、动态和连续到达的信息,如传感器网络数据、电话记录、金融数据和商业交易数据等。传统静态数据集作为信息的载体已无法有效表达该类信息,因此,数据流作为一种新的数据类型被提出并广泛应用于上述领域。数据流作为一种连续到达的、潜在无限输入的数据有序序列,与传统静态数据集相比,具有以下几个特征:(1)数据高速到达;(2)数据规模宏大;(3)数据流是有序数据;(4)数据流具有动态变化性;(5)数据流往往伴随高维特性。上述特征使数据流无法被传统数据挖掘分类算法有效处理,因此对数据流挖掘算法的研究成为数据挖掘领域的热点之一。本文聚焦于数据流集成分类问题,围绕个体分类器生成与结论融合两个方面,对噪音数据流、高速数据流以及类标签不完整数据流的集成分类问题展开研究,主要研究工作如下:首先,针对利用噪音数据流训练集成分类器,集成分类器的分类准确率受噪音数据影响严重的问题,提出一种交叉验证容噪数据流集成分类器算法。交叉验证容噪分类算法是一种典型的噪音消除算法,可以在建立分类模型之前有效去除数据集中的噪音数据,使分类模型的分类准确率明显提高。由于目前并没有学者对其有效性进行理论证明,因此本文通过有噪音数据集的样本复杂度理论,对其有效性进行了严格的理论推导,并根据推导结果提出了一种新的交叉验证容噪分类算法,应用在数据流环境里,进一步提高了集成分类模型对噪音数据流的分类能力。其次,针对高速数据流数据到达速度远远超过处理器的处理能力,处理器无法利用全部数据训练个体分类器的问题,提出一种基于偏倚抽样的高速数据流集成分类器算法。抽样技术可以有效缩减待处理的数据规模,减少集成分类器的训练和更新时间,由于不同的抽样策略产生的训练数据集,建立集成分类器,其分类准确率具有明显区别。因此本文通过集成分类器期望错误的偏差方差分解,计算各个待抽样数据的期望错误贡献度,并通过集成分类器分类性能的几何分析,说明抽取期望错误贡献度大的数据作为训练数据更新集成分类器模型,可以有效提高集成分类器的分类准确率,并依此提出了基于偏倚抽样的高速数据流集成分类器算法。再次,针对数据流中数据类标签难以全部获得的问题,提出一种基于聚类假设的半监督数据流集成分类器算法。传统半监督分类算法虽然能够解决类标签不完整数据集的分类问题,但如何将其引入数据流环境,利用数据流特性提高半监督分类算法的分类准确率仍是一个有待解决的问题。本文通过基于聚类假设的半监督分类算法分类误差分析,表明在训练个体分类器时增加有标签数据集的规模可以有效减少分类算法的分类误差,并利用此结论,提出了基于聚类假设的半监督数据流集成分类器算法。最后,针对选择性集成分类算法训练一旦结束,被选择的个体分类器组合就以确定,无法针对具体数据进行动态调整的问题,提出一种两阶段数据流选择性集成分类器算法。本文首先通过分析说明,选择性集成分类算法获得的个体分类器集合,虽然在整体数据集上具有最优的分类性能,但对某具体数据分类时,并不一定是最优的个体分类器组合。因此,利用支持向量数据描述算法,动态自适应选择数据分类时的个体分类器集合,可以有效避免上述情况的发生,提高选择性集成分类器的分类性能。

【Abstract】 With the development and application of information technology, people can collect lots of high-speed, dynamic and continuous information, such as sensors network data, telephone records, financial data and commercial transaction data etc. Traditional static data set s the information carrier has already been unable to efficiently express such information a. Therefore data stream as a kind of new data type is put forward and widely used in the above fields. Data stream is a kind of orderly data sequences which can continuously arrive in and potentially infinitely be input in. Compared with the traditional static data set, data stream has the following features:(1) Data reach with rapid speed. (2) Large-scale data. (3) Data stream is orderly sequences. (4) Data stream can dynamic change. (5) Data stream often is high-dimensional. The above features let data stream cannot be delt with effectively by the traditional data mining classification algorithm.So the research of data stream mining algorithm become one of the hot spots in the data mining area.We focused on classifying data stream using ensemble classifiers in this paper. From two aspects of training individual classifier and integrating classified outcome, we studied noisy data stream, high speeding data stream and data stream without complete label. Main job is as follows:First of all, aiming at the promble of ensemble classifiers classification accuracy is influenced seriously by noise which is trained by noisy data set, a cross validation noise-tolerance data stream ensemble classifiers algorithm was proposed. Cross validation noise-tolerance classification algorithm is an important method which eliminates noise of data set. That can eliminates noise of training data set before training classifier, So classifier classification accuracy be able to increased significantly. However, there have been not scholars to prove it validity in theory. According to sample complexity theory of noisy data set, algorithm validity was proved in this paper. And according to outcome of proving, a new Cross validation noise-tolerance classification algorithm which deal with data stream was proposed. It can further increase classification accuracy of classifier which deal with noisy data set. Secondly, aiming at high speed data stream existing the phenomenon that the data rate is higher relative to the ensemble classifiers’computational power, so ensemble classifiers can’t train all data to update themselves. An ensemble classifiers based on biased sample was proposed. Sampling technique can effectively reduce to data scale, so it can decrease time of training and updating ensemble classifiers. However, training differnet ensemble classifiers using data set produced by different sampling strategy, their classification performance has obvious difference. Therefore, by means of expectation error bias variance decomposition method, computing all data’s expectation error contribution degree which waited for being sampled. And through geometric analysis of ensemble classifiers classification performance, it be proved that using data which have bigger expectation error contribution degree to train ensemble classifiers, the ensemble classifiers have more classification accuracy. According to that an ensemble classifiers algorithm based on biased sample was proposed in this paper.Further, aiming at the promble that it is hardly to label all data in data stream, a semi-supervised data stream ensemble classifiers algorithm based on cluster assumption was proposed. Although traditional semi-supervised classification algorithm can solve incomplete label data sets classification problem, but it is an unsolved problem that how to use it in data stream environment and how to improve semi-supervised classification algorithm accuracy by using data stream characters. According to analyzing generalization of semi-supervised classifier based on cluster assumption, it indicates that increasing labeled data during training moment can improve semi-supervised classifier accuracy. Making use of this conclusion, a semi-supervised data stream ensemble classifiers algorithm based on cluster assumption was proposed.Finally, aiming at the promble that after training in selective ensemble classifiers, it is determined which individual classifier be selected and be unable to dynamicly adjust with specific data. Two-phase selective ensemble classifiers algorithm of data sreams was presented. Through the analysis it is indicated that individual classifiers be selected by selective ensemble algorithm can have best Classification performance in whole data set, but they may be not optimal combination of individual classifiers to classify specific data. Hence, Dynamic adaptive choosing individual classifiers by Using support vector data description algorithm can Effectively prevent this situations and Improving selective ensemble classifiers classification performance.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络