节点文献

Deep Web动态搜索的研究

Research on Deep Web Dynamic Search

【作者】 李海滨

【导师】 许南山;

【作者基本信息】 北京化工大学 , 计算机应用技术, 2011, 硕士

【副题名】基于图书网站的动态搜索

【摘要】 本文针对图书类网站的特点,根据表单项前的文字信息反映表单项输入信息,设计一种通过解析表单项动态填充表单的方法,利用动态解析表单获得结果页面,对其进行解析并加权排序,最后按照统一的显示格式展现。本文设计实现利用网站自身高级搜索页面对同一类型的多个网站进行检索的系统,为用户同时在多个图书网站搜索图书提供便利快捷的条件。实验结果验证了算法设计的正确性,本课题的主要研究工作包括:1、设计一个基于字典匹配的动态表单搜索算法。该算法采用SAX方式解析表单,避免前人采用DOM方式解析产生的大量无用信息;利用多线程方式解析查询接口所在页面提高处理性能;运用字典和表单项关键字进行匹配。服务器端程序通过抓取页面进行语义分析,发现新的图书网站和扩展关键字字典。2、在表单动态填充获取的结果基础上,实现了结果页面解析。’通过预先了解并熟悉图书网站的搜索结果的展示页面的HTML标签结构,将这种标签结构进行抽象提取,利用抽取模板进行解析获得图书信息对象的链表,完成结果解析。3、查询结果后续处理。对于结果页面解析出的结果项进行排序,主要考虑的因素是该类似图书在不同网站的出现频数和在各个网站的排序顺序。两个因素同等重要,都可以反映出图书受欢迎的程度和销售情况,因此采用等值加权排序法。在以上工作的基础上,设计实现了一个基于图书网站高级搜索的动态表单搜索系统。该系统提供一种较为新颖的思路,对于同一类型的网站,通过其高级搜索页面进行精确查询项匹配。

【Abstract】 Based on features of book sail websites, the thesis designs he method that form items is parsed and filled dynamically according to the text before the input item reflect the information to be input in the input item, Make use of dynamically form to get result page, parse and sort results page by weight, at last display them according to the uniform display format. The paper designs and implements the system to query the same type of multiple websites based on their advanced search pages, at the same time the system provide convenient and efficient condition to query books on multiple book websites for users. Experimental results demonstrate the correctness of algorithm, the main research topics include:1. This paper has design a dynamic form search algorithm based on dictionary matching. The algorithm parses a form with SAX to avoid large quantities of useless information with existing DOM; improve processing performance with multiple threads to parse query interface page; make use of dictionaries to match key words of form items. On server side pages are crawled to make semantic analysis, to find new book sail websites and expand the book keywords dictionary. 2. Based on the results of dynamic filling the form, the paper realizes the result pages parsing. Through foreseeing the structure of the HTML tags on the search result pages, extract this tag structure with abstract extracted tags to get books information object linked list, and complete results analyses.3. The proceeding work of query results. To resolve the results of sorting in the result page, main consideration factors are the frequency the similar books in different websites appear and sorting in every website. Two factors are equally important, both reflect popularity and sales situation of books, so the paper uses equivalent weighted ranking.On the basis of above work, designed and implemented a library website search system based on advanced search pages. The system provides a relatively new idea, for the same type of website, precisely inquiry items through its advanced search page.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络