节点文献

词间语义关系的研究及其在文本分类中的应用

Study on Term Semantic Relationship and Its Application in Text Categorization

【作者】 崔晓源

【导师】 何丕廉;

【作者基本信息】 天津大学 , 计算机应用技术, 2006, 硕士

【摘要】 自动文本分类是信息检索领域的基本任务之一。随着互联网上的信息量呈爆炸性增长,人们很难从大量的文本信息中迅速有效地提取出所需信息。为了解决信息迷向的现象,对文本分类的研究显得越来越重要。本文设计并实现了基于模块化的可扩展自动文本分类系统。对分类过程中的各重要环节进行了细致全面的研究和分析。在此基础上我们提出了将自然语言处理领域中的词语语义关系挖掘模型与文本分类系统相结合的方法,目的在于解决目前向量空间模型中词语相互独立这一基本假设的不合理性。同时期望通过利用文本中词语间的深层内涵,在较小的向量空间内表示更加丰富的文档信息,并以此提高文本分类的测试效果。语义关系挖掘模型利用语言学的句法分析和信息学的统计思想,通过对文本语料的深层挖掘,得到词条间网状语义关系词典。该词典资源丰富了文本的向量信息,使得向量表示更加高效简洁。我们把该模型与强大的SVM分类器模型结合在一起,显著提升了分类系统的结果。在实验中我们将该模型与标准的词袋模型在20NG和Reuters测试语料上进行比较。结果表明语义关系扩展可以明显改进文本分类的准确率和召回率。而且还可以在保证分类结果的同时,有效地降低计算的空间和时间复杂度,使得对超大规模文本语料的分析成为可能。最后,作者提出了语义关系挖掘模型在信息检索领域中未来的研究方向。

【Abstract】 Text categorization is one of the basic tasks in information retrieval. With the explosive growth of web information, people have difficulty in finding the required information from massive information. In order to solve the so called“information confusion”problem, Research on text categorization gradually seemed to be more important.This paper design and implement a module-based scalable automated text categorization framework. We also did a comprehensive survey on each important step in the framework. Based on this framework, we bring up a method that integrating the term semantic relationship into classic text categorization task. This method can solve the inherent irrationality in the assumption of Vector Space Model that terms are treated independently. Meanwhile we show that the deep association between terms can be used to improve the result of our current experiment.Term semantic relationship can be obtained by using sentence parsing in natural language processing and statistical method in information theory. We presented the deep term relationship in the form of thesaurus which can make the document vector more informative and effective. When combined with the classification power of SVM, this method yields high performance in text categorization.We compare this technique with SVM-based categorization and other term relationship model on 20NG and Reuters-21578 dataset using the simple minded bag-of-words (BOW) representation. The comparison shows that our method outperforms others model in most cases.Finally, we bring out some future research on using term semantic relationship in information retrieval area.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2007年 01期
  • 【分类号】TP391.1
  • 【被引频次】3
  • 【下载频次】214
节点文献中: 

本文链接的文献网络图示:

本文的引文网络