

The Research and Implementation of Distributed Chinese Full-text Search Technology

【作者】 李杨

【导师】 蒋天发;

【作者基本信息】 中南民族大学 , 计算机应用技术, 2009, 硕士

【摘要】 随着互联网的发展,搜索已成为从互联网上获取信息的一种主要手段,通过GOOGLE、百度等互联网搜索引擎,人们可以方便的从浩如烟海的互联网中寻找自己需要的信息。以GOOGLE为例,它搜集了数以亿计的网页,存储容量为T级,人们通过关键字从中检索到自己需要的信息,这一类搜索引擎通常被称为通用搜索引擎,它的数据采集对象是互联网页,它的应用对象是全世界的所有网民,它的服务方式是提供给用户关键字检索结果后服务即完成。另一方面,企业、组织机构内部信息化建设浪潮催生了大量的信息内容,其中大多数的数据以文件、邮件、图片等非结构化形式存放在企业内计算机系统中的各个角落,而传统的结构化数据库无法满足这些非结构化信息的存储、检索和处理要求,针对这一类应用出现了一种特定的搜索引擎――企业搜索引擎。它往往不局限于关键字搜索后就完成服务,往往还提供分类、聚类等后期处理和挖掘。而全文检索技术是实现企业搜索引擎的核心环节,本文将对其进行系统的阐述,并深入的探讨全文检索的各项技术和基本原理,详细地分析全文检索系统的结构和索引的组织、库结构和创建过程,提出了优化索引创建过程的方法。对检索技术、排序算法和中文分词技术进行了重点研究和总结,并针对词典分词法的不足,使用了改进的基于三数组Trie索引树匹配算法,充分实现了“智能分词”的原则。然后讨论了分布式索引的分布策略,以及基于索引数据分布上的查询策略。本文最后对本系统全文搜索引擎的特点及实现进行详细的论述,并按设计完成具体功能的实现,实际检测运行效果较好。

【Abstract】 With the development of Internet, search engine has become a major mean of accessing to information on the Internet. It can be convenient to find information needed from the vast Internet through Internet search engine such as GOOGLE,Baidu. Take GOOGLE for example, it collects billion of web pages, the Storage capacity is based on the Terabit level. People search the information they need by input the keywords. This kind of search engine, often referred to as general search engines. It collects the Internet pages, and serve of all Internet users around the world, Its service Provide the search results to the user.On the other hand,a great deal of information content has been born in enterprises and organizations within the information technology wave. Most of the data such as file, mail, photographs, and other unstructured forms stored in the every corner of enterprise computer system. The traditional structured database can not deal with these unstructured storage,retrieval and processing requirements of information. In response to this type of application, there is a specific search engine - business search engine. It is often not limited to keyword search,often also provides classification, clustering, and data mining.And the realization of full-text search technology is the core of enterprise search engine. This thesis will expatiate on their systems,explore the full-text search technology and the basic principles deeply,analysis of the retrieval system and indexing,database structure and constructive process,an optimization method of the index creation process. In addition to research and concluded search technology,sorting algorithm and Chinese word segmentation techniques. For the lack of dictionary segmentation,improve the use of three array based Trie index tree matching algorithm, realize fully of the principle of "brainpower segmentation". Then discussed the strategy of distribution and strategy of query. Finally,this thesis circumstantiate the system features of full-text search engine and achieve the system,a majority of function has been achieved after the actual testing.
