

Research of Chinese Information Retrieval System and Document Reranking

【作者】 方芳

【导师】 陈建勋; 刘茂福;

【作者基本信息】 武汉科技大学 , 计算机应用技术, 2010, 硕士

【摘要】 随着计算机系统性能的提高,互联网信息的飞速发展,以及企业信息化程度的迅速提高,中文信息资源以极快的速度递增。信息的增加在满足人们对信息需求的同时也给人们快速、准确的查找所需要的信息带来了一定的难度。在这种情况下,信息检索技术成为研究的热点。信息检索(Information Retrieval,IR),通常指文本信息检索,包括信息的存储、组织、表现、查询、存取等各个方面,其核心为文本信息的索引和检索。信息检索的主要技术包括索引处理、查询扩展、检索模型、重排处理等,中文信息检索还涉及到分词处理。针对中文信息检索相关技术的研究,本文的研究内容可以分为两个部分。首先,以NTCIR7的中文IR4QA子任务为实验背景,设计并实现了一个中文信息检索系统。系统在索引时对原始文本进行分词处理后以词为单元生成倒排索引,检索部分则采用了经典的向量空间模型。为了解决词不匹配的问题,检索得到初始结果后,利用一种基于局部共现的查询扩展方法进行查询扩展处理。实验结果表明,经过查询扩展处理后,系统性能得到明显提升。对于系统所得结果,经过NTCIR7官方评价工具的评估,可以看到我们的检索系统有较好的检索性能。另外,对特定类型问题进行了文档重排技术的研究。针对检索系统将检索结果反馈给用户时,用户往往只浏览前N个检索结果的情况,本文结合开放性资源维基百科和定义以及人物传记这两种类型问题的特点,将与特定问题相关的维基百科页面引入,以对初检结果进行文档重排处理。实验表明,这种方法能有效提高排在前面的文档的精度。

【Abstract】 With the improvement of computer system performance, the rapid development of Internet information, as well as the degree of enterprise informatization, the Chinese information resources get a fast rate of increase. The increases of information meet the information needs of people and also lead to the difficulty for the fast, accurate search requirement at the same time. In this case, the information retrieval technology becomes a research hotspot.Information Retrieval usually refers to text information retrieval, including information storage, organization, performance, query, access and other aspects, and the core of it is the text indexing and retrieval. The main technique about information retrieval system includes the index processing, query expansion, retrieval model, document reranking and so on. For Chinese information retrieval,the word segment technique is also very important.The studies about the Chinese information retrieval of this paper can be divided into two parts. Firstly, taking the NTCIR7 Chinese IR4QA subtask as the experimental background, we complete the design and implementation of a Chinese information retrieval system. The index function component segments the original documents into words and then generates an inverted index with word units. The retrieval component applies the classical vector space model. In order to solve the problem of word mismatch, a query expansion method based on the local co-occurrence is employed for attaining more useful key words and generating a new query after obtaining the initial search results. The experimental results show that this query expansion strategy improves the system performance significantly. And evaluated by the NITCIR7 official tool, we can also see that our system owns a relatively good performance. Secondly, we do research on document reranking technique about the specific types of questions. When the retrieval system returns the results to the users, the users may be used to just browse the top N documents. In view of this kind of phenomenon, we try to improve the precision of the top results by document reranking. This paper notices the characteristics about the open resource Wikipedia and the definition as well as the biography type of questions. We make use of the Wikipedia pages related to the specific questions for document reranking. Experiments show that our method can improve the precision of the top results efficiently.
