【摘要】 随着因特网的迅猛发展,大量的信息朝着人们扑面而来,如何管理好所得到的信息的问题变得越来越突出,对文本进行分类管理是人们经常采用的一种文件管理方法。 本文提出了一个基于概念的自然语言文本自动分类模型,该模型以《知网》为主要的概念知识源,以词所表示的概念为分类基础,把概念继续分解至义原,并在可分义原组成的向量空间进行文本分类。该模型概述如下:文本分类系统分为训练模块和分类模块,义原分为可分义原和不可分义原,文本在经过预处理后,按一定规则提取出关键词,对有岐义的关键词,根据其词性和上下文对对其进行概念排岐,根据关键词所表示的概念在《知网》中的定义,把关键词分解成义原,并将不可分义原剔除,从而把文本表示成可分义原向量空间中的一个向量。在训练集中的文本均表示成向量空间的文本之后,训练集中相似的向量在向量空间中会形成文本聚类。对于将要进行分类的文本,亦按上述的方法将其表示为一向量,并在训练集中找出k个与其距离最近的邻居的类别作为该文本的类别。实验表明,该模型相对于基于关键词的文本分类方法有更好的召回率和精确率,进行分类时所需的空间较少,计算时间也相对较短。 本文在三个方面提出了新的思想:第一,首先提出把义原分类为可分义原和不可分义原,并提出分类的原则和方法。这种分类方式可以实现在进行文本分类时,获取概念中最重要的领域特性。第二,虽然现有文献提出用概念来表示文本,但这种概念的表示方式都基于同义词的,把概念分解到义原更能反映出概念的本质和概念之间的相关性,采用义原来表示文本则更反映出文本所要表达的中心意思。第三,首先把概念排岐引入到文本分类中,并提出一种新的概念排岐算法。

【Abstract】 With the rapid growth of Internet, 1ots of information surges toward us. 1thas been an urgent prob1em on how to manage al1 the information we have gotten.Text Categorization (TC) is an important method man usua11y use to deal with thisprob l em.Thi s paper proposes a new automatic natura1 1anguage text categorizationmodu1e based on concept. Thi s modu1e takes HowweNet as the main source of knowledge,the concepts of words as the bas;is of text categorization. The concepts of wordsare reduced to sememes and the TC is performed in the Classfiab1e Sememe VectorSpace (CSVS). The TC modu1e can be summarized as be1owt the TC system is dividedinto two parts t training part and categorization part. Sememes are divided intoc 1assfiab 1e sememes and unclassfiab1 e sememes. Keywords are extracted from thetext after it has been preprocessed. The keywords are di sambi guated accord ing totheir parts of speech and context. The concepts of keywords are then reduced tosememes according to their definitions in How--Net. As a resu1t, the text isrepresented as a vector in the CSVS after removing a11 unclassfiab1e sememes. Thesimi1ar texts form a c1uster in the CSVS. FOr a new text, it is represented asa vector as above and we find k nearest neighbors with the vectors of the trainingtexts. It is supposed that the maximum category of those k texts is the categoryof the text. 1t has been approved by experiments that the reca11 and the precisionof this TC module are better than those TC modu1es based on keywords. This modu1etakes 1ess ca1culating time and working space and too.This paper puts forward new ideas in three ways. 1. The sememes are dividedinto classfiable and unc1assfiab1e sememes. We a1so propose the princip1e andmethod on how to get classfiab1e sememes. In thi s wny, we can get the most importantdomain attributes of a concept. 2. A1though there are papers use concept torepresent a text, the representations are represented by synonym. Reducing aconcept to sememes can represent the nature of the concept more accurate1y andthe re1evance between concepts more natura11y. As a resu1t, the main idea of atext is represented more accurate1y by sememe. 3. The words disambiguation arefirstly put into use in text categorization. A new disambiguation a1gorithm isput forward in this paper.

