

The Application of Cross-Language Information Retrieval Based on Latent Semantic Analysis

【作者】 闭剑婷

【导师】 苏一丹;

【作者基本信息】 广西大学 , 计算机软件与理论, 2008, 硕士

【摘要】 随着英特网的发展,人们越来越多的面临怎样有效地查找相关外语文件的问题。在互联网发展初期,网络内容以英文为主,上网用户也多来自美、英等发达国家,但此后,来自其他国家的网站和用户数逐渐增加,给传统的以英语为唯一语言的信息检索技术带来新的问题。为此研究直接用用户的母语进行信息检索成为必要,进而研究双语言或多语言的跨(交叉)语言信息检索也成为一个热门的话题。跨语言信息检索研究的是基于一种自然语言构造的查询搜索任意语言文档的方法,因为单一语言信息检索的研究已经比较成熟,而且已经实用化,因此目前跨语言信息检索技术的基本框架都是从单语言信息检索继承发展而来。但由于不同的语言背后隐藏着差别很大的文化背景和人文习惯,机器翻译的效果至今不能满足人们的要求,所以仅仅依靠单语言检索的方法不能满足解决跨语言信息检索中的语义匹配等深层次问题。本文首先介绍了跨语言信息检索的研究内容和相关技术及其国际评测标准,接着分析了潜在语义分析的原理和建模方法以及相关的应用。然后根据潜在语义分析的语言无关性等特点,用其分析双语文本,建立词语翻译模型,并引入双向翻译思想,提高翻译准确率。随后针对传统跨语言信息检索中查询扩展方法的缺陷,结合k-means聚类和潜在语义分析模型对文本和词语表示的优势,提出一种新的扩展方法,减轻翻译出错或翻译歧义对查询结果的影响,最后更新了传统的查询词权重计算公式,提高了检索的平均准确率。

【Abstract】 With the development of Internet, more and more people face the problem of retrieving foreign language information effectively. In the early days of Internet, web pages were English, and most casual users came from developed countries such as America or England. Subsequently, the gradual increase of the websites and users from non-English speaking countries brings new problems for traditional English-only information retrieval system. Therefore, it’s necessary to study how to use our native languages to get foreign language information. So cross-language information retrieval became a hot topic.The goal of cross-language information retrieval is to get foreign language information from native language. Because the effectiveness of the monolingual information retrieval is pretty good, most researchers take the technology of monolingual information for reference during research on cross-language information retrieval. But the effectiveness of machine translation is poor because of cultural difference. So far, the technology of cross-language information retrieval can’t satisfy with the requirement at the semantic level.In this paper, we introduce the main technology of cross-language information retrieval and relative international evaluation standards at first, and then describe the principle and modeling of latent semantic analysis and its applications. After that, we propose a translation model based on latent semantic analysis combining the theory of bi-directional translation. The experimental results show that the precision is better than traditional vector space model. Subsequently, to circumvent the defects of traditional cross-language information retrieval query expansion, we propose a new method for cross-language query expansion based on k-means clustering and latent semantic analysis. The method can relieve the negative influence of wrong translation or the ambiguity of words in translation. At last, we update the weightings of each word in new query. The results show the improvement of average precision.

  • 【网络出版投稿人】 广西大学
  • 【网络出版年期】2009年 01期