节点文献

基于统计语言模型的中文网页信息检索研究

Research on Chinese Web Information Retrieval of Model Based on Statistical Language

【作者】 李贞

【导师】 李进华;

【作者基本信息】 华中师范大学 , 情报学, 2012, 硕士

【摘要】 互联网飞速发展,信息呈指数增长,信息获取途径更为多样化,但是信息搜索却变得更为复杂了。人们迫切需要高层次的信息处理技术来处理海量信息,快速检索到所需信息,从而帮助更好的进行决策和研究。然而,信息处理技术的普及与广泛应用很大程度上得益于自然语言处理技术的发展,为了有效解决信息检索问题,对信息检索在文档内容表示、检索模型、匹配策略以及排序算法等方面的研究逐渐增多。其中,对检索模型的研究仍然是信息检索研究的一个热点,各种检索模型和方法相继出现,如:布尔模型、向量空间模型、概率模型。尤其是近年来提出统计语言模型,将自然语言与统计学相结合来研究信息检索,借助强大的数学基底,成为信息检索中占据统治地位的检索模型,并取得了大量研究成果。对中文网页海量数据进行研究,并将中文分词组件与lemur结合构建适宜于中文的信息检索系统方面的研究相对缺乏。本文在大规模中文网页语料库CWT200G的基础上,参考TREC和SWEM信息检索标准流程,以Lemur为基准工作平台,将其与中科院分词组件—汉语词法分析系统ICTCLAS相结合,形成一个可供实验的简单的信息检索系统。首先,阐述了本文的理论基础,介绍了基于统计语言方法的中文网页信息检索模型研究中所要研究的重点问题:统计语言模型、数据平滑、中文分词和中文文本索引等。然后,对信息检索评测的中文网页语料库和实验所需平台及系统进行简单介绍,对数据如何处理做了详细分析。最后,通过实验数据对比分析向量空间模型、概率模型等传统信息检索模型与统计语言模型对中文网页语料库进行主题检索时性能优劣;同时,在统计语言模型进行主题检索实验的时候,分别对Simplified Jelinek-Mercer平滑方法、Dirichlet Prior平滑方法和Absolute Discouting平滑方法进行实验,并对比这三种平滑方法在信息检索中的性能。

【Abstract】 As the rapid development of Internet, information has grown exponentially, accessing information becomes more and more diverse, but information search has become even more complicated. An urgent need for high-level information processing technology to handle the vast amounts of information, and retrieve the necessary information to quickly to help people make better decisions and research. However, the popularity and wide application of information processing technology is largely thanks to the development of natural language processing technology, in order to solve the problem of information retrieval effectively, the research of information retrieval in the document content, the retrieval model, matching strategy and sorting algorithms gradually increasing. Retrieval model is still a hot topic of information retrieval research, a variety of retrieval models and methods have emerged, such as:boolean model, vector space model, probabilistic model. Especially in recent years, put forward a statistical language model, combining the natural language and statistical, with a strong mathematical basement, statistical language models become dominant in the information retrieval model, and has made a lot of research.On the basis of large-scale Chinese web corpus CWT200G, reference the information retrieval standard procedures of TREC and SWEM, combining the working platform of Lemur with word components which is Chinese lexical analysis system ICTCLAS of the Chinese Academy of Sciences’s products, and available a simple information retrieval system. First of all, described the theoretical basis of this paper describes the need to study the key issues in the study of Chinese Web information retrieval method based on statistical language model:statistical language model, data smoothing, Chinese word segmentation and Chinese text indexing. Then a brief introduction on the Chinese Web page corpus of information retrieval evaluation and experimental platforms required, and system and do a detailed analysis of the data is how to deal with. Finally, the experimental comparison of the data analysis of the pros and cons of the traditional vector space model, probabilistic model of information retrieval models and statistical language model on the Chinese Web page corpus theme retrieval performance; the same time, the topic retrieval experiments in the statistical language model, respectively Simplified Jelinek-Mercer smoothing method,Dirichlet Prior smoothing methods and the Absolute Discouting smoothing method, and compare the performance of the three smoothing methods in information retrieval.

  • 【分类号】G354
  • 【被引频次】2
  • 【下载频次】260
节点文献中: