

Research of Chinese Full Text Retrieval Technology

【作者】 于波

【导师】 何婷婷;

【作者基本信息】 华中师范大学 , 计算机应用技术, 2003, 硕士

【摘要】 全文检索技术是信息处理的各领域中的重要技术。本文对全文检索技术进行了以下几方面的研究: 1、介绍了国内外检索技术的发展过程,讨论了普通文本检索、概念信息检索、超文本信息检索、多媒体信息检索、数据挖掘等的技术特点。 2、研究了全文检索技术的两种主要索引方法的特点和实现过程。其中基于字表的检索方法由于具有无需分词、实现容易的优点,因而在实践中被广泛采用。然后针对该算法存在的“索引库较大、匹配速度不高和查全率较高而查准率较低”等缺点,引入了第二种检索方法:基于词表的检索方法。 3、研究了中文自动分词技术,这是中文全文检索钟的关键技术。对其中的几种方法,如机械匹配法(即MM法)、特征词库法、约束矩阵法、语法分析法和理解切分法等做了详细的比较和分析,并归纳出各自特点。其中MM法由于实现简单,并且是其它方法的基础,本文对其进行了着重介绍。 4、在MM方法的基础上,本文对一种利用基于字、词和词组的混合模型来实现中文全文检索的方法进行了探索和研究。该算法的基本原理是:把所有的单字、词还是词组都作为语词,建立汉语词语二叉树。分词时,读取二叉树右边的内容,并比较左节点的长度,得到有意义的最小长度的语词。然后又在这种算法的基础上进一步讨论了一种改进的MM法以减少词语的歧义切分。 5、设计了校园网内Web页面的搜索引擎,该引擎的主要特点是:将搜索引擎主要分为前端和后端,后端获取Web文档,然后分词,建立和更新索引;前端提取索引库中的内容,向客户提供检索服务。在该系统中利用网络蜘蛛,扫描校园网中所有HTML文档,寻找所有与检索关键字相关的页面。并将向量空间的思想运用到其中,即可提取出其中的资源中心,即检索结果。

【Abstract】 The full text retrieval (FTR) is the primal technology of disposing the information. The article does some research of the full text retrieval technology.1、 The article summarize the development of the web search technology in the domestic country and aboard. It will refer to not only the common document retrieval in the web, but also the query of concept information, hypertext information, multimedia information and the data mining. These new technology are also introduced briefly. The article lists the specification of the full text retrieval technology, at the same time the deficiencies are also referred and the trends of the future are demonstrated.2、 The paper demonstrates the two index methods of the FTR. Search based on the words list is very simple in the implementation of the algorithm without dividing the words and it is used widely. Because of considerable storage space and larger index database, higher rate in the full searching and the lower rate in the exact searching, the article demonstrates a new retrieval method based on the phrase list.3、 Chinese Words Divided Syncopation Technology is the difficulty of the query technique based on phrase. Some divided syncopation such as mechanical matching method, feature phrase library method, restriction matrix method, syntax analysis method and comprehended syncopation method are emphasized. The MM method is easy to realize and the foundation of other methods, and is introduced emphatically.4、 The article purpose the hybrid modeling based on character, word and phrase as the Chinese FTR using MM method. To reduce de divergent divided syncopation an improved MM method is prompted.5、 The retrieval system adopting the algorithm could search for World wide web pages in school. The search engines could be classified front searching engines and meta searching engines: the meta one get Web document, then slice the word,establish and update index; the front one extract the content of the index library, provide the users query service. It uses network spider to scanning all HTML documents and find out the pages which is useful. Then it uses the idea of Vector Space Model (VSM) to pick up the result.

  • 【分类号】TP391.3
  • 【被引频次】11
  • 【下载频次】763

