节点文献

基于潜在语义分析的文本检索算法研究

Research on Text Retrieval Algorithm Based on Latent Semantic Analysis

【作者】 赵亚慧

【导师】 崔荣一;

【作者基本信息】 延边大学 , 计算机应用技术, 2009, 硕士

【摘要】 文本信息检索技术的研究目标是从大量文本信息集合中识别和获取所需要的文本信息。在互联网普及的当今社会,文本信息检索技术已经成为人们有效利用信息资源,快捷、全面地吸收和获取文本信息的一条重要途径。这种技术越来越被人们所迫切需要,对人们的学习和科学研究有着重大意义。本学位论文研究在文本集中高效、高质量地检索定位语义上与查询文本相似的段落的文本检索策略和算法。本文采用的文本表示基础模型是向量空间模型(SVM),语义表现手段基础是潜在语义索引(LSI)模型,搜索算法的基础是遗传算法(GA)。本文的主要工作如下:(1)分析潜在语义空间的构造方法。利用奇异值分解方法处理词项-文本矩阵,并根据奇异值分布特征对该矩阵进行最小平方误差意义下的最佳近似,由此构造出潜在语义空间的投影矩阵。任意文本向量通过该投影矩阵可表示在潜在语义空间中,一方面可以有效消除词项之间的相关性,另一方面可以抑制噪声的干扰。(2)提出查询文本与大容量文本之间非相关性的有效判定方法。查询文本向量表示为潜在语义空间分量和零语义空间分量,而当其潜在语义空间分量小于给定阈值时,即可判定该查询文本与大容量文本中的所有段落都不相似,在检索策略中可以放弃进一步的细节匹配。(3)设计利用遗传算法的段落检索算法。当查询文本的潜在语义空间分量足够大时,把该空间中的所有段落(子文档)作为匹配对象,与查询文本的潜在语义空间分量进行余弦相似度匹配。由于采用遗传算法,高效地定位近似最优的段落;同时,由于检索是在潜在语义空间进行的,因此定位的段落在语义上与查询文本相似。实验结果表明,本文提出的基于潜在语义的文本检索策略和基于遗传算法的文本检索方法与传统算法相比,在检索的准确率、召回率以及F-指标等方面都有较大的提高,而且所提出算法在检索效率方面也优越于传统的文本信息检索方法。因此本文提出的基于潜在语义的文本检索策略和基于遗传算法的文本检索方法可用于大容量文本信息检索中。

【Abstract】 The target of text information retrieval technique is to recognize and obtain desired textual information from massive texts. Nowadays, with the popularization of internet, text information retrieval technique has become an important way to effectively utilize information resources, rapidly and comprehensively obtain text information. Being of great significance to study and scientific research, the technique is being demanded more and more urgently.A text retrieval strategy and algorithm which could efficiently and accurately retrieve and locate similar paragraphs in the sense of semantic was investigated in this dissertation.In the dissertation, SVM is taken as basic model to represent a text, LSI as basis of semantic expression means, and GA as basis of searching algorithm. Main works are as follows:(1) Methods to construct latent semantic space were analyzed. After decomposing it with singular value method, lexical item-text matrix was approximated in the sense of minimum squar error according to distributing features of singular value, so as to construct a projection matrix in latent semantic space. By the projection matrix, any text can be represented in the latent semantic space. On one hand, correlativity between items could be removed availably; on the other hand, noise interference could be restrained.(2) An efficient method to determine non-correlativity between desired text and large-scale texts was proposed. The desired text can be represented to components of latent semantic space and null semantic space. While the component of latent semantic space was less than the preset threshold, it may be concluded that paragraphs in the desired text are not similar to every one of the large-scale texts, and further matching in retrieval strategy could be abandoned.(3) Paragraph retrieval algorithm based on GA was designed. When the component of latent semantic space corresponding to the desired text was large enough, all paragraphs (sub-text) in the space were taken as objects to be matched with the component of latent semantic space corresponding to the desired text by cosine similarity. Approximately optimal paragraphs can be located efficiently based on GA. Meanwhile, because texts were retrieved in the latent semantic space, located paragraphs was similar to the desired text semantically.Experimental results show that, compared with traditional methods, the accuracy, recall ratio and F-index of the proposed text retrieval strategy based on semantics latent and the retrieval method for large-scale texts based on genetic algorithm are all enhanced rapidly. And that, the retrieval efficiency of the algorithm proposed in the dissertation is also superior to traditional text information retrieval methods. Therefore, the advanced text retrieval strategy based on semantics latent and the retrieval method based on genetic algorithm are applicable to large-scale text information retrieval.

  • 【网络出版投稿人】 延边大学
  • 【网络出版年期】2011年 S1期
节点文献中: