节点文献

云模型在文本挖掘应用中的关键问题研究

Research on Key Problems in Text Mining Based on Cloud Method

【作者】 代劲

【导师】 何中市;

【作者基本信息】 重庆大学 , 计算机科学与技术, 2011, 博士

【摘要】 文本挖掘(Text Mining,简称TM)是以文本信息作为挖掘对象,从中寻找信息的结构、模型、模式等隐含的、具有潜在价值知识的过程。TM在信息检索、模式识别、自然语言处理等等多个领域均有所涉及。由于文本是信息存储的最主要途径,因此TM的重要性也日益凸显。在目前TM的研究中,传统的数据挖掘方法依然占据着主导地位。然而随着TM技术研究的进一步深入,将传统的数据挖掘方法应用于TM面临着越来越严峻的挑战。例如文本对象的高维稀疏性、算法复杂度过高及需要先验知识等等问题,已经严重阻碍了TM技术的推广应用。TM面临的这些难题归根到底都是由于自然语言的不确定性造成的。自然语言中(尤其是文本语言)的不确定性,本质上来源于人脑思维的不确定性。这种不确定性使得人们具有更为丰富的理解空间与更为深入的认知能力,然而随之而来也形成了TM的众多难题。因此,若能从降低自然语言的复杂性入手,在充分利用现有技术的基础上勇于创新,探索出适用于TM的不确定性人工智能处理方法,将会大大促进TM技术的快速发展。借助不确定性知识研究的重要工具——云模型在定性概念与定量数据间的转换作用,作者将云理论引入TM关键问题研究当中。用以抛砖引玉,为TM技术的进一步发展提供一种新的思路与解决方法。本文的主要内容如下:①云模型在TM中的理论扩充。对文本知识表示以及相应模型的物理空间转换方法、文本概念的相似性度量进行了研究,为云模型的引入打好理论基础。包含以下三个方面内容:1)基于VSM的文本信息表。将知识表示中信息表的概念引入文本表示,在VSM模型基础上将文本系统用文本信息表来进行知识表示。2)基于云模型的文本信息表转换。文本间的不确定性关系可以通过云模型进行概念表示,但前提是各属性的取值须处于相同的论域内。也就是说文本在不同属性上的值都有必须具有同一物理含义。未处理的文本信息表属性含义不统一并且取值也差异较大。因此,在利用云模型进行数据挖掘前,必须将文本信息表进行转换。在概率统计方法的基础上,本文提出一种新的文本信息表转换方法。通过该方法,文本信息表由不同属性空间转换成同一物理空间中,体现了属性取值的概率分布。3)基于云相似度的文本云相似度量。目前TM中一般使用余弦相似度来衡量文档之间的相关性,但目前无论哪一种相似度度量方法均是以基于对象属性之间的严格匹配进行计算,而对文本对象的整体性考虑不足。结合TM中文本对象的整体性质与个体特点考虑,本文提出了基于云向量数字特征的云相似度。用云向量的数字特征来对文本进行整体刻画,文本间的相似即可转换为云向量之间的相似进行度量。此相似度不仅能快速提高挖掘性能,找出对象间的共性特点,而且能充分考虑到属性值的随机性与模糊性。②基于云模型的文本特征自动提取算法。特征选择是文本特征降维的一种有效方法。现有选择尺度的确定均通过实验验证得到,即基于经验的方法。在综合考虑文本特征整体与局部分布基础上,提出了一种高性能的文本特征自动提取算法。算法应用云隶属度对特征分布进行修正,在不需任何先验知识的条件下通过云隶属度大小来对特征权值进行刻画并完成特征的选择,充分体现了特征的概率分布特点。通过横向实验对比与结果分析,显示出该特征集不仅特征个数较少,而且分类精度较高,在性能上领先于主要的一些特征选择方法。③基于云概念跃升的文本分类算法。云模型对定性知识表示、定性定量知识转换具有较好的处理能力。在此基础上,利用云模型中的概念抽取方法来进行文本分类应用。在将文本集转换为基于VSM模型的文本知识表的基础上,对训练集中相同类别文档的定性概念进行跃升。根据测试文本与各类别定性概念之间云相似度的大小决定测试文本所属类别。通过在不同特征提取方法下与不同分类器的性能对比,证明该算法不仅具有较强的特征适应能力,在分类性能上也优于主流的分类器。④基于云相似度量的快速无监督文本聚类。针对目前文本聚类算法存在的问题,提出了一种基于云相似度量的快速无监督文本聚类算法。该算法以特征自动提取算法为基础,在k-Means动态聚类算法上,用逐级逼近的策略来获取最优k值。k值获取的过程也就是自动聚类的过程。在此过程中,提取每一个文本的云模型数字特征,然后采用云相似度来计算文本和文本间的相似程度。该算法不仅避免了文本对象的高维稀疏性,而且保留了k-Means均值算法的高效。同时,逐级逼近的策略也解决了聚类簇数需先验知识的缺点,得出的聚类结果更符合文本分布特点。

【Abstract】 Text Mining (TM for short) is a process to find out the potential value of text knowledge, such as text information structure, model and pattern, etc. TM involves data mining, pattern recognition, information retrieval, natural language processing and other fields. Because text is the main way to store information, the importance of TM is increasingly obvious.In the present research to TM, traditional data mining methods still dominated. However, with further research in TM, it faces more severe challenges to apply the traditional data mining methods. These difficulties, such as the huge dimensions and sparsity of text object, the high complexity of algorithm and the requirement of prior knowledge and so on, have seriously hampered the development of TM.In the final analysis, these problems in TM process are due to the uncertainty of natural language. The uncertainty of natural language (especially text) comes from the uncertainty of the human thinking in essence. It makes people to have a richer understanding of spatial and cognitive abilities, but also brought a series of problems to TM. Therefore, from the point of reducing the complexity of natural language, if we can carry out the advanced innovation, which based on making full use of these existing technologies, and find out a novel uncertainty artificial intelligence approach for TM, it will greatly facilitate the rapid development of TM.Cloud model is an important tool in the uncertain knowledge research. With the efficient conversion function between qualitative and quantitative data, cloud model is introduced to the key issues of TM. Our primary works are as follow.(1) Cloud model theory expansion in TM.The researches, which involve text knowledge representation, the physical space conversion of the corresponding model and the similarity measures of the text concept, have been carried out. The following three aspects are contained.1) Text information table based on VSM model.The information table in knowledge representation system is introduced to text representation. On this basis, text system is expressed as text information table based on VSM model.2) Text information table conversion based on cloud model.When cloud model is used to deal with the uncertainty relations between texts, it musts be guaranteed that the values of every attribute are the same domain. That is to say, the different attribute values of text have the same physical meaning. But the attributes of existing text information table have different inner meaning and their values are vastly different. It needs to convert these attributes to the unified physical space. Using probability statistical method, a text information table transformation algorithm is proposed. Through this algorithm, the attributes of text information table have been converted to the unified physical space and it reflects the probability distribution of them.3) Text similarity measure based on cloud similarity.The cosine similarity is commonly used method to measure the similarity between texts in text mining. Yet not matter what kind of similarity measure is based on the fact that object properties must strict match. It will result in the lack of consideration of the integrity of text object. Combined the overall distribution with the individual characteristics of text object, a novel cloud similarity is proposed based on vector digital characteristics of cloud, which is used to describe the overall text. By cloud similarity, the similarity between texts is converted to the similarity between cloud vectors. It not only improves the mining performance and can quickly identify the common features, but also fully considers the randomness and fuzziness of the attribute values.(2) Text feature automatic selection algorithm based on cloud model (named FAS).Feature selection is an effective method for reducing the size of text feature space. So far, some effective methods for feature selection have been developed. For the purpose of acquiring the optimal number of features, these methods mainly depend on observation or experience. In this paper, by combining the overall with the local distribution of features in categories, a high performance algorithm for feature automation selection (FAS) is proposed. By using FAS, the feature set can be obtained automatically. Besides, it can effectively amend the distribution of features by using cloud model theory. Analysis and open experimental results show the selected feature set has fewer features and better classification performance than the existing methods.(3) Text classifier based on cloud concept jumping up (named CCJU).With the efficient conversion function between qualitative and quantitative data, the concept extraction method of cloud model is applied to text classification. On the basis of the conversion from text collection to text information table based VSM model, the text qualitative concept, which is extraction from the same category, is jumping up. According to compare the cloud similarity between the test text and each category, the test text is assigned to the most similar category. Through the comparison among different text classifiers based on different feature selection methods, it full proves that CCJU not only has a strong ability to adapt to the different text features, the classification performance is also better than the traditional classifiers.(4) Rapid and unsupervised text clustering based on cloud similarity (named CS-Means)Aiming at the shortcomings of the existing text clustering algorithm, a rapid and unsupervised text clustering based on cloud similarity is proposed. After text pretreatment using FAS algorithm, it takes a gradual approach strategy to obtain the optimal k (cluster number) value based on k-Means clustering algorithm. The process to obtain k value is the automatic clustering process. In this period, the digital characteristics of text cloud vector are extraction firstly. Next, the cloud similarity degree is used to measure the similarity between texts. The algorithm not only avoids the difficulties which bring by the huge dimensions and sparsity of text objects, but also retains the high performance of k-Means. At the same time, the gradual approach strategy also solves the problem which is how to assign the cluster numbers. So, the clustering results are more meet the characteristic of text distribution.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2011年 12期
  • 【分类号】TP3;TP391.1
  • 【被引频次】8
  • 【下载频次】984
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络