节点文献

文本挖掘技术研究及其在综合风险信息网络中的应用

【作者】 张翔

【导师】 周明全;

【作者基本信息】 西北大学 , 计算机软件与理论, 2011, 博士

【摘要】 随着电子文本以爆炸式地速度增长,从海量的文本数据中寻找有用的知识已成为数据挖掘的重要课题。本文以“十一五”国家科技支撑计划重点项目——“综合风险防范(IRG)关键技术研究与示范”(2006BAD20B02)为研究背景,针对综合风险信息智能采集和分类任务结合互联网上风险灾害信息的特点,研究文本挖掘中的表示模型、特征选择、文本分类和文本关联关键技术,研究具有重要意义和实用价值。主要研究进展包括:(1)提出了一种综合风险信息的表示模型。分析了空间向量模型的tf~*idf权重计算方法忽略了特征在类间分布情况的不足,结合综合风险信息为Web信息的特点,设计了一种综合考虑特征项频率、逆文档频率、特征项类别权重和HTML标签的综合风险信息的特征权重计算方法。实验证明可以改善风险信息的分类性能。(2)提出了基于ReliefF结合RMI评估函数的特征选择方法。针对传统文本挖掘的特征选择方法因忽略了特征项之间的相关性导致特征子集中存在大量冗余特征的问题,设计一种组合式的文本特征选择方法,基于ReliefF特征选择算法将无关特征去除的基础上,利用RMI评估函数对冗余特征进行过滤。实验证明与传统的特征选择方法相比可有效去除文本特征中的冗余性。(3)提出了基于可信度的AttributeBagging文本分类算法。针对Bagging算法中弱分类器具有相同权重的不合理问题,设计改进的Bagging算法,通过对训练样本的属性进行重取样获得多个训练样本集合,以kNN为弱分类器,计算各个弱分类器的可信度得到其投票权重,最终根据投票规则获得集成分类结果。实验证明该算法构建的文本分类器比Attribute Bagging算法具有更好的分类效果。(4)提出了基于灰色关联分析的主题词提取方法。通过计算综合风险信息的给定主题词与特征项之间的灰色关联度来实现主题词的提取,其主要优点是克服了“小样本”问题,对于样本量的多少和有无规律同样适用。解决了数理统计的主题词提取方法忽略专业低频词贡献的问题。(5)将文本挖掘关键技术研究成果应用于综合风险信息网络中,结合网络主题爬虫技术,设计实现了互联网上综合风险信息的智能采集和分类,取得了良好的效果。

【Abstract】 With rapid development of Internet technology and the exponential growth of electronic text information, how to find the useful knowledge from large amount of data becomes an important topic of data mining. This thesis is based on the National Science and Technology Planning Project of "11th Five-year" Plan which is named "Key technology research and demonstration of Integrated Risk Guardians (No.2006BAD20B02)". According to complete intelligent acquisition and classification of Integrated Risk Information, some key technologies of text mining, such as representation model, feature selection, text classification and text association have been studied. Based on that, some exploratory researches are carried out considering the features of Integrated Risk Information. The main contributions are summarized as follows:1. The representation model of integrated risk information is proposed. The tf~*idf weighted method based on the space vector model is analyzed first, and then, by ignoring the shortage of distribution information among classes, considering the Integrated Risk Information as web information, a weighted method of the integrated risk information is proposed, which comprehensively considers the feature items frequency, inverse document frequency, category weight of feature items and HTML tags. Experiments show that this method can improve the performance of text categorzation.2. A text feature selection based on ReliefF algorithm and RMI evaluation function is proposed. Aiming at the problem that those traditional feature selection methods of text mining neglect the relevance between features, which leads to massive problems of redundant features in the feature subsets, a combined method of text feature selection is designed. First, irrelevant features are removed by ReliefF algorithm, and then redundant features are filtered by RMI evaluation function. Experiments show that this method can remove the redundant features of text more effectively compared with the traditional ones.3. A text classifier based on confidence attribute bagging is is proposed. Aiming at the problem that weaker classifiers of Bagging have the same weights, an improved Bagging algorithm is developed. This algorithm gains more training sets by re-sampling the attributes of the samples. The classified weights can be calculated from each weaker classifier which is based on kNN. The ensemble classification results can be achieved based on voting rules. The classifiers ensemble results which is based on voting rules. The algorithm is used to design a text classifier, which is better than Attribute Bagging algorithm.4. A key-phrase extraction method based on gray associate analysis is proposed. Gray associate between given key-phrase and feature words is worked out by which key-phrase is extraction. The main advantage of this method is that it can be equally applicable for large and small quantity of samples and ignore whether the sample is regular. So it can sovle the problem that the key-phrase extraction methods using mathematical statistics ignore the contribution of low-frequency professional words.5. The proposed algorithms are adopted to Integrated Risk Information Network. Based on the technology of focused crawler, the intelligent collection and classification of Integrated Risk Information is implemented and achieves better performance.

  • 【网络出版投稿人】 西北大学
  • 【网络出版年期】2011年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络