节点文献

基于RSS和本体语义适配的自治主题页面采集

RSS and Ontology Semantic Based Autonomic Web Page Collection in Vertical Search Engine

【作者】 张浩斌

【导师】 胡华;

【作者基本信息】 浙江工商大学 , 计算机应用, 2008, 硕士

【摘要】 搜索引擎是伴随着互联网信息扩展营运而生的,其任务是帮助网民在海量信息中去粗存精,快速找到自己所需的信息。调查表明,2006年搜索引擎已成为仅次于电子邮件,位居第二的互联网业务。通用搜索引擎在满足海量搜索信息的同时却难以兼顾搜索准确度与相关度质量,很难满足追求精准的个性化、专业化搜索需求。垂直搜索(Vertical Search)是针对某一个行业的专业搜索引擎,是搜索引擎的细分和延伸,是对网页库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。垂直搜索引擎是面向特定领域和主题信息检索的工具,面向主题的页面采集是其基础工作。本文针对其核心和基础性工作—主题页面采集进行了分析和研究,主要的工作有并重点从以下几方面进行:1、在DOM解析的基础上,提出了改进型的HPath页面抽取技术;针对DOM解析器异构现象,运用HPath基础解决不同解析器的集成应用难题,为商用化的主题页面采集与垂直搜索引擎研究奠定了理论和技术基础。2、面向新兴的Web2.0网络,提出基于Web2.0基础的高精度主题页面采集方案,并通过XPath解决RSS标准不统一问题。3、在主题页面采集的后期处理上,提出用本体语义适配来解决来自各种不同系统的主题语义异构问题,采用语义距离算法对页面主题进行归纳和分类。4、为了提高采集系统的实用性和可维护性,本文尝试采用IBM自治计算框架,结合改进主动数据仓库ECA规则,提出了具有一定自治能力的主题页面采集系统设计。

【Abstract】 Search engines are important tools/programmes for people to fast locate online information. Users can obtain the appropriate information by keywords/full-text searching via search engines. While general-purpose engines bring forth the massive information to the user query, they have trouble in maintaining comprehensive and up-to-date search indexes. They fail to deliver high accurate and correlated results and couldn’t satisfy the personalized and professional query.Vertical search can be regarded as the extension and customization of general ones. Such engines focus on a certain domain, identify and integrate the domain specialized information, extract the needed data, and wrap them into formatted information. Within which, topic oriented web page collection is the key and basic part. On the basis of the analysis on vertical search, the author has performed lots research and implementation of the web page collection. The main research work presented in this paper is as following:1. It prompts HPath web extraction method on the basis of DOM parsing, to solve the heterogeneous DOM parsing. By doing so, it presents a base for commercial topic oriented web page collection and vertical search engine both in theory and practice.2. It brings forth a scheme for high precision topic web page collection on the basis of Web2.0 technique, and solves the multi-standard problem in RSS.3. An ontology semantic adption solution is presented to cope with the heterogeneous semantic of web pages from various systems, and semantic distance function is defined for web page conclusion and classification.4. The ECA rule system is modified to fit IBM’s automonic computing framework, and an automonic web page collection system is designed which targets on the applicability and maintenability.

  • 【分类号】TP393.092
  • 【下载频次】135
节点文献中: 

本文链接的文献网络图示:

本文的引文网络