节点文献
基于潜在语义索引的中文文本检索研究
Research of Chinese-Text Retrieval Based on Latent Semantic Indexing
【作者】 李媛媛;
【导师】 马永强;
【作者基本信息】 西南交通大学 , 计算机应用技术, 2008, 硕士
【摘要】 互联网上绝大多数的信息是以文本的形式保存的,文本信息的爆炸式增长给信息检索技术带来了巨大的挑战,人们越来越难以快速准确地从网上检索到相关信息。在目前使用最多的基于关键词的字符匹配检索中,参与匹配的只有词的外在形式,而日常语言中多词同义、一词多义等不确定性因素的存在,使得用户很难简单地用关键词或关键词串来真实地表达真正需要检索的内容。而潜在语义索引(LSI—Latent Semantic Indexing)模型的出现有效地克服基于关键词检索无法处理多义词和同义词问题,它具有可计算性强、需要人参与少等优点。LSI通过截断的奇异值分解建立潜在语义空间,词汇和文本都被投影在该空间,进而可以提取词汇间深层次的语义关系,从而呈现出自然语言中的语义结构,进一步提高了检索性能。本文围绕着如何利用LSI技术及其特点进一步提高中文文本检索的性能展开讨论。首先对LSI的相关关键技术以及数学基础进行了深度挖掘,对其在中文文本中的应用进行了举例和深入分析。其次对LSI的重要优化过程——权重计算进行了深入分析,提出了一种基于“非线性函数”和“位置因子”的新权重方案,并对其效果进行了对比验证。然后利用LSI能够方便计算出文本和文本相似度的特点,提出了“文本—文本检索”功能,弥补了由于检索语句较短和输入不准确等问题对检索查准率的影响,能够更好的帮助用户进行更加有效的检索。最后,开发了“中文潜在语义索引分析系统”作为实验平台,针对LSI的每个相对独立的环节专门设计实验方法,以可视化的方式呈现实验结果,文中所有研究内容都在该系统中作了验证。
【Abstract】 Most information on Internet is based on text. The explosive growth of text information is a great challenge to information retrieval, making it increasingly difficult to find useful information on internet rapidly and accurately. In the most used information retrieval based on keywords match, what match is the explicit representation, but there exists uncertainty in natural languages, such as synonym and polysemy. It is not easy for users to express what they really want to retrieve just with keywords or keyword chains.Latent Semantic Indexing Model is easy to calculate and requires less human intervention. Latent semantic Space is established by truncated singular value decomposition, terms and documents are projected onto the LSI-Space. Then the semantic relationships among terms are abstracted to present the semantic structure of natural languages, it improves the retrieve performance.The thesis focuses on how to improve the Chinese text information retrieval system performance based on LSI and its features. Firstly,The key technology and mathematical basis of LSI were analyzed deeply. Examples were given and analyzed which aimed at Chinese text retrieval. Secondly,The term weighting which is of great importance in LSI is studied in detail, and a new weighting design based on non- linear function and location factor was proposed. The retrieval performance has been improved further.Using the concept that the LSI-Space can calculate the relation among documents conveniently, "doc-doc retrieval" is put forward to make uers’ retrieval more effectively. It offsets the effects that the retrieval sentences and input inaccurately affects the retrieval precision. At last, an experimental platform, namely"Chinese LSI Analysis System" ,has been developed. In this system, each vital link in LSI is correspond to special experimental method, and presents the result visually. All aspects in the dissertation are evidenced with experiments on this system.
【Key words】 Information Retrieval; Latent Semantic Indexing; Term Weighting; doc-doc retrieva;