

The Intelligent Search Technology Based on Latent Semantic Analysis

【作者】 王洋

【导师】 印桂生;

【作者基本信息】 哈尔滨工程大学 , 计算机软件与理论, 2010, 硕士

【摘要】 近年来互联网飞速发展,已经发展成为包含多种信息资源、站点遍布全球的巨大动态信息服务网络,为用户提供了一个极具价值的信息源。搜索引擎为用户提供了友好的检索接口,能帮助人们从浩瀚的数据中抽取出对用户有用的信息,能极大地节省用户的查询时间。互联网上绝大多数的信息是以文本的形式保存的,互联网上文本信息的指数级增长给搜索引擎技术带来了巨大的挑战,人们越来越难以快速准确地从网上搜索到相关信息。由于自然语言中多词同义、一词多义等不确定性因素存在,相同概念可以有多种不同的表述方式。传统的基于关键词字符匹配的搜索引擎中,参与匹配的只有外在的表现形式,而非它们所表达的全部概念,用户很难简单地用关键词或关键词串来真实地表达真正需要查询的内容。把搜索引擎技术从关键词匹配的层面提高到语义的层面,从语义意义上智能地认知和处理用户的查询请求成为当前搜索引擎技术的研究热点。本文从智能搜索建模的角度出发结合潜在语义分析技术,研究了搜索引擎中文档处理、查询处理以及最后的信息匹配处理。在此基础上,对潜在语义空间中权值从概率角度进行了分析与改进,使其更能体现出文档间、文档与词汇间的语义关系;对用户查询进行语义扩展,补充了用户输入信息不足或与索引词汇不匹配的问题;对用户搜索结果不理想进行调整,提出二次搜索的策略改善搜索结果使其更贴近用户要求。最后文本设计并实现了基于潜在语义分析的智能搜索系统验证了算法可以在一定程度上搜索引擎对语义的理解,并获得较高的准确率与查准率。

【Abstract】 In recent years, the Internet is growing fast and it has already been a great dynamic information service network full of all kinds of information around the world, which provides users with a valuable source of information. Search engines offer us user-friendly search interfaces that can help people acquire useful information from huge data, which can save a lot of time for user’s query.The vast majority of information on the Internet is saved in the form of the text. The exponential growth of text message has brought great challenges to the search engine technology. Due to multi-word synonyms, polysemy and other uncertainties that exist in natural language, the same concept can have many different patterns of expression. The traditional search engines based on keywords matching simply use keywords or keyword strings rather than the genuine concept which the users want to express. Thus, search engines need to develop into semantic level from keywords matching. Recognizing and dealing user’query intelligently in search engine technology have come into focus.This paper gives research on document processing, query processing and the final match of information processing in search engines combined latent semantic analysis technique from the f view of intelligent search modeling. On this basis, word weight values in the latent semantic space are analyzed and improved in the probabilistic sense, so it can better reflect the semantic relations between words and documents. Next, User’queries are expanded to complement the lack of information which the users give or mismatch between users’words and index vocabulary. In addition, second search strategy was proposed in the paper to enhance the search results to be closer to user requirements when users not satisfied with their first result. In the end, the intelligent searching system based on Latent Semantic Analysis was designed and implemented, which can apperceive users’intension to some extent and get a higher rate of accuracy and precision.
