节点文献

基于表单的深度搜索技术研究

Research on Form-Based Hidden Web

【作者】 徐荣

【导师】 蒋宗礼;

【作者基本信息】 北京工业大学 , 计算机软件与理论, 2008, 硕士

【摘要】 目前大多数搜索引擎仅仅搜索超链接可以搜索到的静态网页,而许多的重要数据存放在web的后台数据库中,它们需要通过表单查询的方式获取,相应的网页称为隐藏网页。为了帮助用户获取更多的信息,本文讨论隐藏页面的搜索方法,给出了系统架构,并讨论其中的关键技术。本文首先分析了当前普遍采用的互联网信息搜索引擎的优缺点,比较通用搜索与深度搜索的不同,提出了适合深度搜索的爬行策略,即利用链接分类、文本分类进行聚焦爬行。并通过设置同一站点内停止搜索标准条件,对规则网站设置路径学习,尽量找到含有表单的网页。本文通过模拟用户访问深度网页的过程,开展了如下工作:首先,通过调查研究,提出适合能快速有效地下载含有表单的网页的爬行策略;然后处理网页,抽取出表单信息,将网页表单信息转换成程序可以理解的形式,即对表单进行建模。其次,利用启发式规则和表单分类方法提取有用的表单。再次,对表单标签和语义词进行提取,自动填写提交,找到需要网页。本文充分利用表单的结构和文本信息,其中的分类器使用标签分类和表单周围有用文字分类比较的办法。用Centroid、KNN、SVM算法进行训练。实验表明,表单周围文本分类效果好,用SVM算法效果最佳。最后,对表单自动填写的Name value table进行了一些讨论。通过实验验证了表单分类和表单信息抽取的有效性。

【Abstract】 Most of the search engine only retrieve public indexable web (PIW) which is obtained by hyperlink. But the fact is that with the development of web, more and more information are stored in web’s backstage database. These data can be retrieved only through HTML form; they are called Hidden web page. In order to help people to obtain the important data in the web database, we have a system which can seach the hidden web pages. In this paper, the architecture is presented, and the key technologies are discussed.First, the common search engine’s advantages and disadvantages are analysised, and the difference between common search engine and hidden web search engine are compared. The proper strategy which suits to hidden web crawlling by using link classifier and text classifier is given. This can achieve focus crawl. In addition, based on the specific characteristics of forms, the new stopping criteria that is very effective in guiding the crawler to avoid excessive speculative work in a single site is introduced.In this paper, the process of user’s accessing hidden web is simulated. First, forms are converted to an understandable form for program. It means modeling to the form. Secondly, the useful forms are extracted by using heuristic rules and form classifier. At last, form label and the context of form are extracted. The results are filled in the forms automatically to find the hidden web page.We make the full use of the structure and text information of forms. The classifier includes the cooperating of label classifing and the form appendix context classifing. We use Centroid,KNN and SVM algorithm. The experiments show that SVM algorithm has the best effect.Through the experiment we verify the effectiveness of form classifing and form extracting.

  • 【分类号】TP391.3
  • 【下载频次】194
节点文献中: