

A Study on the Technique of Information Acquirement Based on Meta-Search and Content Clustering

【作者】 翁勍力

【导师】 赵捧未;

【作者基本信息】 西安电子科技大学 , 情报学, 2007, 硕士

【摘要】 目前网络信息已经成为主要的情报源,其获取的主要方式之一就是使用搜索引擎。但是,利用搜索引擎获取的网络信息仍存在很多问题:例如获取的信息量很大但是有用信息很少;获取的信息多样但是用户无法识别相关信息群体等。有用信息资源的获取已经逐渐成为情报业发展的一个瓶颈。因此,如何从海量信息中剔除无用信息,迅速定位至信息群,从而快速、高效地获取情报资源,并对其进行加工整理并提供给情报用户,是情报界人士面临的一大挑战,也是目前亟需解决的问题。本论文以提高情报获取效率与质量为主要目标,研究和实现了基于元搜索与内容聚类的情报获取系统。主要创新点:(1)设计了情报获取系统的总体框架,提出了搜索模块、运算模块、用户模块三大功能模块,并阐述各模块的功能流程。(2)提出了基于网页标题摘要分析方法进行元搜索引擎结果相关性判断。实验结果表明,元搜索引擎搜索结果的平均准确率比各个成员引擎的搜索结果平均准确率都有较大提高。(3)结合当前两种主要的聚类算法—K-means划分法和BIRCH聚类算法,提出了在元搜索结果处理基础上进行聚类的方法。实验证明,该方法在聚类效果上有较明显的改善,并且效率得到了很大提高。(4)在情报获取系统的设计实现方面,提出了数据库系统、软件系统、人机界面的设计方案,实现了基于网页标题摘要分析的信息检索、基于元搜索结果和K-means与BIRCH算法结合算法的聚类分析,以及基于OLAM的多维分析。

【Abstract】 Web has become the main resource to acquire information, and the Search Engine is main tool. However, the information acquired is still unsatisfactory. Users cannot distinguish the useful information from enormous unstructured search results. Users desire to get good information with high efficiency, conquer information overload and harness the true power of information.In this paper, to promote the efficiency and quality of information acquisition, we study and develop the system of information acquirement based on Meta-search and content clustering techniques with following approaches: (1) we propose a whole frame with three modules: search module, operation module and user module, their work flow are introduced too. (2) Promote the method of analyzing title and abstract of web page to judge relevance of search results. The experiment proved the improvement of average veracity comparing with the member search engines. (3) Put forward a clustering method based on meta-search and two clustering algorithms--K-means and BIRCH. The evaluation of experiment shows the improvement on clustering results and efficiency. (4) In system design and realization aspects, we introduce the database system, software system, and human-machine interface, it can complete the three functions, i.e. the information retrieval based on title and abstract analysis, clustering based on meta-search and two clustering algorithms--K-means and BIRCH, and multidimensional analysis based on OLAM.

  • 【分类号】G354
  • 【被引频次】3
  • 【下载频次】287