

The Description of Text’s Feature Based on Semanteme Concept

【作者】 余刚

【导师】 朱征宇;

【作者基本信息】 重庆大学 , 计算机应用技术, 2005, 硕士

【摘要】 文本的特征描述是自然语言处理、文本分类、聚类、中文信息检索、个性化服务等研究中的一项基础性工作,它研究的是用什么样的方法和模型来表示文章的主题思想。这个描述一方面要能很好的概括文章的主要内容,另一方面要方便计算机进行计算。目前,基于矢量的方法即VSM 得到了广泛的应用,它用若干个特征项和其权重来表示一篇文档。在这个模型中,有两个主要影响描述准确度的因素:一个是特征项的选择,一个是特征项的权重计算方式。广大学者的研究也主要集中在这两个方面,都希望从这两方面能够概括出文本的主题思想,反映其内在的隐含信息。利用统计和信息论的相关知识选择特征项和计算权重在一定程度上解决了VSM 模型描述文本的准确度问题,但一般能涉及和揭示特征项语义信息的比较少,本文主要在以下两方面来解决VSM 如何蕴含特征项的语义信息。(一)考虑词语出现的语言环境对词语的实际语义的重要影响,在现在广泛使用的TF-IDF 权重计算方式上进行了改进,采用了基于词同现频率的权重计算方式来表示文本的权重,该计算方式既含有TF-IDF 公式的相关统计信息,又表现了具体的语言环境对词语语义的影响。(二)在文本的相似度比较上,完全抛弃了纯数学的计算向量相似度的公式(如:计算向量间的欧氏距离、计算向量的夹角余弦、贝叶斯算法、K 最近邻算法等)。改为首先求向量中特征词间的语义相似度,再计算两向量的最大权匹配,最后统计每个匹配对的相似度和,当然在统计和的过程中要考虑每个特征词的权重。这样计算的好处在于:考虑了向量特征词的语义信息,并且在获得文本的向量描述时,不用消歧和规范化处理。最后,我们通过构建了一个文本分类器,把我们在这两个方面的研究与其它方式进行了比较,用实验验证了我们提出的算法在一定程度上提高了分类的准确率和召回率。虽然我们的研究主要是针对个性化服务的,但对中文信息检索和自然语言处理同样适用,可以推广到其它涉及到语言处理的领域。

【Abstract】 The description of the text’s feature is a fundmental work for NPL ,document categorizing and clustering, Chinese information intrieval, personal service and so on. It focuses on the method and model to present the topic better. The feature discription should summarize the content of the document on one aspect; It also should think about that the model facilitate the computer’s processing. Currently, the VSM is used widely. The VSM use several feature words and their weights to present a document. In this model, there are two factors affecting the description’s precision: one is the choice of the feature words; another is the method of weight computing. Most of the scholars’research focus on these two points and they hope to summarize the documents’topics and reflect their connotative information. Utilizing the statistics and the knowledge of information entropy to choose the feature words and compute their weights, these two methods improved the VSM’s precision to describe the document to some extent. But there are few method can reflect the feature terms’semanteme. This paper mainly discuss how to solve the problem that reflect the VSM’s terms’semantic information from the following two aspects: (I) Considering that the context has great impact on the word’s right semanteme, we improve on the TF-IDF method which is most widely used to compute the term’s weight. Our method is based on the words co-occurrence. This method contains TF-IDF’s information and also reflect the specific context’s impact on words’semanteme. (II) As for comparing the texts’similarity,we abandon the pure mathematical method(e.g. the Euclidean distance, the cosine of the vectors’s angle, Bayes Algorithm, K-means and so on). Instead, we compute the similarity of different vector’s terms firstly and compute the the largest power match of the two vectors. Lastly, we compute the sum of the match-pair’s similarity and the terms’weights should also be considered. The advantage of our method exists in : it considers the terms’semanteme, avoid dispelling ambiguity and normalization. At last, we construct a classifier to compare our method with others. We use experiments to prove that our method has improved the precision and recall to some extent. Althoug our research aims at personal service, it can be applied to chinese information retrieaval and NPL.

