节点文献

基于概念空间的文本分类的应用研究

A Study on Concept-VSM And Its Application in Text Classification

【作者】 黄海英

【导师】 林士敏;

【作者基本信息】 广西师范大学 , 计算机软件与理论, 2002, 硕士

【摘要】 随着文本信息的快速增长,特别是Internet上在线信息的增加,文本(网页)分类显得越来越重要。由于文本分类有助于用户有选择地阅读和处理海量文本,可以在较大程度上解决目前网上信息杂乱的现象,方便用户准确地定位所需的信息和分流信息,因此,文本自动分类已成为一项具有较大实用价值的关键技术,是组织和管理数据的有力手段.文本分类的方法分为两类:一是基于知识的分类方法;二是基于统计的分类方法。基于知识的文本分类系统应用于某一具体领域,需要该领域的知识库作为支撑,由于知识提取、更新、维护以及自我学习等方面存在的种种问题,使得它适用面较窄。而基于统计的分类方法由于采用纯粹的数学运算,不苛求复杂的语言学知识和领域知识,以及在实际应用中所体现出来的良好效果,成为目前流行的文本分类方法。现在广泛应用的基于统计的模型有向量空间模型、Naive Bayes模型、实例映射模型和支撑向量机模型。其中向量空间模型(Vector Space Model,VSM)是由G.Salton等人在20世纪60年代提出的,把文档简化为以项的权重为分量的向量表示,把分类过程简化为空间向量的运算,使得问题的复杂性大大减低。此外,向量空间模型对项的权重评价、相似度的计算都没有作出统一的规定,只是提供一个理论框架,可以使用<WP=4>不同的权重评价函数和相似度计算方法,使得此模型有广泛的适应性。但此模型一般采用索引词来表示文档,分类是通过文档之间的字、词匹配来实现,是浅层次的词匹配,而非深层次的语义匹配,是不准确的。显然,字、词的同义性和多义性将分别对文本分类的查全率和查准率产生不利影响。LSI(Latent Semantic Indexing,潜在语义索引)方法是1988年S.T.Dumains等人提出的一种新的信息检索代数模型,其基本思想是文本中的词与词之间存在某种联系,即存在某种潜在的语义结构,因此采用统计的方法来寻找该语义结构,并且用语义结构来表示词和文本,这样的结果可以达到消除词之间的相关性,化简文本向量的目的。LSI利用统计计算导出的概念索引进行信息检索,而不再是传统的索引字、词。LSI基于这样一种断言,即文档库中存在隐含的关于词使用的语义结构,这种语义由于部分地被文档中词的语义和形式上的多样性所掩盖而不明显。LSI通过对原文档库的词—文档矩阵的奇异值分解(Singular Value Decomposition)计算,并取前k个最大的奇异值及其对应的奇异矢量构成一个新矩阵来近似表示原文档库的词—文矩阵。由于新矩阵消减了词和文档之间语义关系的模糊度,从而更有利于信息检索。与传统信息检索模型相比,LSI的优势表现在:向量空间中每一维的含义发生了很大的变化,它反映的不再是词的简单出现频度和分布关系,而是强化的语义关系;用低维词、文档向量替代原有词、文档向量,可以有效地处理大规模文档库。本论文以LSI方法为基础,在文[1][2]的启发下,探讨了基于概念空间文本分类的计算方法。由于文本分类是计算机情报检索的一个分支,论文首先简要地介绍了情报检索与计算机情报检索的涵义及发展简史和发展趋势;计算机情报检索的基本理论、研究对象和方法,以及文本分类的关键技术;然后论述了隐含语义索引(LSI)方法的思想和理论基础,并用图例和一个小的实例对其进行形象化说明,阐述了LSI方法的优势。论文的主要工作是在向量空间模型和LSI的基础上构造文本分类的概念空间并提出在概念空间中词语相似度、文档相似度、待分类文档与类的相似度的计算方法,在大量训练集的基础上,进行概念获取,将文档转化为文档向量,同时构造类基准向量,最后在概念空间中将文档向量与类基准向量进行匹配,完成分类,同时还讨论了有待在概念空间中探讨的分类学习问题。实验证实了基于概念空间文本分类能够取得较好的效果。由于语言中词的同义性和多义性普遍存在,使得基于词匹配的文本分类方法先天不足,本论文提出的基于概念空间的文本分类方法以一个较小的而更健壮的统计导出的概念空间替代原来基于独立词索引的文档向量空间,表现出明显的性能优势,希望将来通过对基于概念空间的文本分类的计算方法的一些比较系统的研究,以期寻求一个既有严格的理论依据,而且在实践中也可行的文本分类方法。

【Abstract】 As the volume of information available on the Internet and corporate intranets continues to increase, there is growing interest in helping people better find, filter, and manage these resources. Text classification - the assignment of natural language texts to one or more predefined categories based on their content - is an important component in many information organization and management tasks. Its most widespread application has been for assigning subject categories of documents to support text retrieval, routing and filtering. In many contexts, trained professionals are employed to categorize new items. This process is very time-consuming and costly, thus limiting its applicability. Rule-based approaches similar to those used in expert systems are common, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use statistical analysis to automatically construct classifiers using labeled training data. The resulting classifiers, however, have many advantages: they are easy to construct and update, they depend only on information that is easy for people to provide, they can be customized to specific categories of interest to individuals, and they allow users to smoothly trade-off precision and recall depending on their tasks. A growing number of statistical classifications have been applied to text categorization, including Vector Space model, Naive Bayes model and Support Vector Machine model. VSM ( Vector Space Model ) was presented by G.Salton in 20 centuries. In the model, each document is represented as a vector of words, as is typically done in the popular vector representation for information retrieval. Because text classification is essentially semantic categorization, the VSM represents the contents of documents and queries with a set of index terms, which can lead to poor classification performance.Latent semantic index (LSI) was presented by S.T.Dumains in 1988, it is a new algebraic model that has achieved good results in information retrieval, which maps documents and queries vectors into a lower<WP=6>dimensional space by singular value decomposition, so that the inherent vagueness associated with a retrieval process based on keyword sets is considerably reduced, and semantic association among the documents is highlighted consequently. LSI is useful to find relation between terms, where human effort does not bring good results. Thus the synonymy can be solved, and the polysemy can be solved partially. With the guidance of LSI and VSM theory and taking paper [1][2] as the foundation, this paper will probe into the text classification based upon concept-VSM. First of all, the paper gives a brief introduction to the concept of information, information retrieval and computer information retrieval, and its development. Then the types of information retrieval model, the approach and basic contents of attribute theory will be dwelled upon. Third, this paper introduces the fundamental principles of LSI, and then using an illustration and an example elucidate LSI advantages. The focus of my work has been on building a concept space based on VSM and LSI, presenting the calculating method of the word-similarity and the text-similarity in the concept-space, acquiring concepts on large training set, converting the text to text vector, and constructing the basis vector. Finally, this paper discusses the future work - problem in the classification study problem in the concept space. At the end of this paper, theoretic analyses and experimental results all show that classification based upon concept-VSM can improve categorize performance significantly, and indicate it has high classification precision and recall on average. Because of existence of the synonymy and polysemy, the text classification based on words is of congenital lack, my thesis presents a text classification method based on concept-VSM with a small but more strong concept space instead of the text vector space ba

  • 【分类号】TP391.3
  • 【被引频次】1
  • 【下载频次】249
节点文献中: 

本文链接的文献网络图示:

本文的引文网络