节点文献

基于LUCENE的藏文全文检索系统研究与实现

Research and Implementation of the Tibetan Full Text Retrieval System Based on LUCENE

【作者】 巴桑杰布

【导师】 欧珠;

【作者基本信息】 西藏大学 , 中国少数民族语言文学, 2012, 硕士

【摘要】 近年来,通过国家一些专项项目的实施,使藏文信息处理研究和开发领域取得了长足的发展,从标准统一到关键性藏文基础软件开发等各方面都取得了突破性成果,具备了进一步研究和开发的先决条件。然而,藏文信息处理技术发展处于起步阶段,藏文全文检索系统等应用系统缺口突出,作为信息社会里人们获取信息不可或缺的工具,研究实现藏文全文检索系统,即是本文努力所在。藏文全文检索系统研究内容包括传统领域的字、词、句子、段落、文章的语法知识,以及信息处理领域的信息检索原理、分词技术、查询方法、文档相关性排序算法等知识。同时,还需要解决互联网信息冗余大、质量良莠不齐、格式繁多、位置分散、关联复杂、用户需求表达难等问题。LUCENE作为开放源代码的全文检索工具包,通过其框架规范,扩展相关功能,以实现目标系统全文检索功能,成为解决以上问题的一种捷径。本文通过对全文检索理论及基于LUCENE全文检索系统的研究的基础上,得到如下成果:第一,设计实现基于LUCENE的藏文分词器,该分词器同时支持藏、汉、英三种语言的二元切分;第二,结合藏文句子的特性——句子主要成分间都通过格助词相连接来表达语义关系,提出了本文实现的藏文分词器的优化策略,同时提出切分格助词之紧缩格的方法及切分紧缩格后的藏字复原方法,以提高切分准确率;第三,利用本文实现的藏文分词器,设计实现了基于LUCENE的藏文全文检索系统,该系统同时支持藏、汉、英三种语言的全文检索。

【Abstract】 In recent years, through the implementation of national special projects,Tibetan information research and development have made great strides inthe field of development, from the standard into Tibetan language basedsoftware development, and other key sectors, it have achievedbreakthrough results and it is a prerequisite for further research anddevelopment. However, the development of Tibetan language informationprocessing technology is in its infancy. Tibetan applications such asfull-text retrieval system gaps have been highlighted. As an indispensabletool of accessing information in the information society, to research toachieve Tibetan language full-text retrieval system is the emphasis on thisarticle.Tibetan text retrieval system includes the traditional areas of the word,words, sentences, paragraphs, grammar of the article, and informationretrieval principle in the field of information processing, knowledge ofword segmentation, query methods, document relevance rankingalgorithm etc. At the same time, it is also necessary to solve theredundancy of Internet information, the quality varies greatly, range offormats, scattered locations, association with complex, and difficulties inthe needs of users’ expressions etc. LUCENE as a full text search tool ofopen source code package, through the specification of its frame,extended functions in order to achieve targeted system for full text searchfunction and to become a shortcut to resolve the aforementionedproblems.This paper is based on the theoretical research of full text search andLUCENE full text search system and gets the following results:First, it designs and implements Tibetan Word segmentation based onLUCENE. It at the same time supports binary segmentation of threelanguages-Tibetan, Chinese, or English.Second, It incorporates the characteristics of Tibetan sentences bycombining the main components of a sentence and auxiliary words toexpress the semantic relations, and advocates optimized strategiesachieved by this article. At the same time it advocates segmentation ofsplitting auxiliary word as well as the tightening method and the restoration of Tibetan words after splitting to improve the accuracy ofsegmentation.Third, by applying Tibetan language segmentation achieve by thisarticle, it is designed to achieve Tibetan-language text retrieval systembased on the LUCENE, while supporting the full-text search of threelanguages--Tibetan, Chinese and English.

  • 【网络出版投稿人】 西藏大学
  • 【网络出版年期】2012年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络