

【作者】 吴东华

【导师】 孙怀江;

【作者基本信息】 南京理工大学 , 计算机应用技术, 2004, 硕士

【摘要】 随着互联网的兴起和信息时代的到来,Web信息获取技术成为当今世界上一大研究的热点。如何最准确的获得人们感兴趣的信息,成为Web信息获取技术研究的重中之重。然而由于互联网内部的多样性以及文档结构的复杂性,Web信息获取技术的研究具有一定的困难,很难涵盖所有范围,专业搜索引擎成为解决这一问题的主要方法。本文选取当今世界上公认最好的计算机专业科学文献搜索引擎Citeseer进行研究,试图提出一种方案,使科学工作者根据自己的兴趣能更加方便、准确的通过Citeseer网站获取计算机类文献。 本文的工作包括: 1.针对Citeseer网站的文献搜集和分析 在对互联网上的信息进行处理时,常常要将分布在互联网各处的Web页面下载到本地供进一步处理,因此本文设计网络爬虫,根据Citeseer网站中文献页面对应的链接具有的特定形式,将文献页面的Html源代码下载到本地数据库中;再根据文献页面显示样式所具有的特定规律进行分析,根据需要从中提取各类信息,分类存储到数据库各个表中,以供进一步研究使用。 2.基于内容和拓扑结构的文献质量评价 本文在Citeseer搜索的结果文献集的基础上,分别根据内容和拓扑结构对这些文献进行重新评价,根据评价结果对文献集进行重新排序,以找到感兴趣的文献。本文中基于内容的文献质量评价根据事先提供的好文献构造“语境图”找到各类样本,分类算法采用朴素贝叶斯理论;基于拓扑结构的文献质量评价采用PageRank算法进行。实验结果表明,这两种评价方法分别从主观和客观角度体现了文献的质量。 3.提出基于内容和拓扑结构相结合的知识决策系统框架 由于基于内容和拓扑结构的方法分别从主观和客观的角度评价文献质量,本文将这两种方法相结合提出一种应用于Citeseer文献搜索引擎的知识决策系统框架。具体表现为根据Citeseer搜索的结果文献集先用基于内容的方法提取出相关文献,再根据PageRank算法对这些文献从客观上进行排序。本文选取比较熟悉的两个领域进行实验,结果表明这种方法具有一定的效果。

【Abstract】 With the spring up of www and the advent of information-exploding age, technology of aquiring web information become a very active subject in the world. How to exactly get interesting information from web is the most important problem.However.since the complexity of web.the relevant research is hard, it is helluva to include all areas, appearance of topic-specific search engine become one of the best solutions.In this paper, we pick out the search engine Citeseer which is believed the best topic-specific search engine to get along with our research,try to put forword a scheme in order to promote scientists to aquire interesting computer papers from Citeseer more convenient and more exactly.Contrbution of this paper includes:1. Collecting and analyzing of paper on CiteseerWhen processing information on the web, we need to download html pages to native computer.In this paper, we design a web crawler on Citeseer to collect html source code of every paper, and storage it in native database,then analysis this information on the display rule of Citeseer,storage the result in corresponding table.The above work is a preparation for the following reseach.2. Qulity evaluation of paper on content and link structureIn this paper,we choose content information and link structure to do our research,the work is based on result papers aquired from Citeseer. We try to find a good means to sort papers over again,in order to find interesting papers more exactly.In the means based on content ,we choose "context foused graph" to find sample texts,and bayes arithmetic as classification theory.in the means of link structure.we choose PageRank arithmetic to do our research .Experiment results show these two kind of methods can right evaluate papers from two different sides.3. A knowledge decision frame based on content and link structureSince the method based on content evaluate papers from subjective point of view,while the method based on link structure evaluate papers from objective point of view,in this paper we put forward a scheme,which combine above two methods,to evaluate papers. Concretely speaking, first we find relative papers based on content,shrinking the size of result papers via Citeseer,then we evaluate these papers based on link structure.bring about results in order of evaluation value .Results of Experiments show thismethod have determinate effect.

  • 【分类号】TP393.092
  • 【被引频次】12
  • 【下载频次】496