节点文献

基于Web挖掘技术的化学物质信息获取方法研究

Research on Chemical Information Acquisition Method Based on Web Data Mining

【作者】 冯硕

【导师】 李书琴;

【作者基本信息】 西北农林科技大学 , 计算机应用技术, 2012, 硕士

【摘要】 随着互联网的发展,网上信息资源与日剧增,采用常规获取信息手段存在准确度不高、效率低下等问题,本文以化学物质常用网站为研究对象,研究快速、高效从网页中获取信息的技术和方法,以实现化学物质环境安全数据库自动更新。首先运用垂直搜索引擎技术,筛选、获取相关的化学物质网页并分析网页结构,按照网页的结构化程度分别采用相应技术和方法;其次,运用排序算法、全局模式等的方法对化学物质网站中的异构数据进行集成。同时为了提高动态信息源网站信息持续、适时抽取,提出了任务分割、失败重试机制、动态更新检查等方法。本文的主要研究内容和结论如下:(1)化学物质网上信息的动态获取方法研究。网上获取化学物质的主要任务是获取CasNo(化学物质登录号)、名称、理化性质等信息。根据网站页面类型,分别运用聚焦爬虫技术和模拟人工浏览方法对网页进行获取;分析网页的树形结构,运用包装器技术抽取出化学物质的相关属性信息,运用正则表达式的方法抽取出非结构化数据中的结构化信息;采用监听器技术,实现了化学物质网站任务的调度,保证了化学物质网上信息的自动获取和数据的适时更新。(2)化学物质异构数据集成方法的研究。针对化学物质网页中数据异构的问题,本文首先根据化学物质环境安全相关的属性确定集成范围,设计了公共数据模型CompoundsDTO作为全局模式,然后运用排序算法对动态获取的数据进行分析,最后将处理后的数据映射到全局模式中,实现了异构数据的集成,有效的消除了异构数据源上的结构冲突和语义冲突。(3)设计开发化学物质环境安全数据管理系统。在构建化学物质环境安全数据库的基础上,运用化学物质网上信息动态获取技术和化学物质异构数据集成技术,设计开发了化学物质环境安全数据管理系统。实现了互联网上化学物质信息的自动、适时抽取,并将结构统一规范的数据运用动态跟新检测技术存入数据库中,实现数据库的更新查询。

【Abstract】 With the rapid development of the Internet, online information resources are increasingday by day; using conventional means of getting information is not high in accuracy and haslow efficiency and other problems. So this passage takes the chemicals commonly used site asthe object of study, researching how to obtain information from a web page fast andefficiently and to make the chemicals environmental safety database automatically update.Firstly, use the vertical search engine technology and get some pages related to chemicals thenanalyze the page structure. We can take the appropriate techniques and methods according tothe degree of webs’ structure. Secondly, use some methods, such as sorting algorithms andglobal model to integrate heterogeneous data in the Chemical Substances Web site. At thesame time we present a segmentation task, dynamic update check methods in order toimprove the dynamic information source website with continuous information and timelyextraction, The main contents are as follows:(1)The dynamic research on online information of the chemical substance.The main taskof online access to chemical substances is to obtain the CasNo (chemical registration number),name, physicochemical properties and other information. Depending on the types of sitepages, respectively use the focused crawler technology and artificial simulation of webpagebrowsing method to obtain webpage; analyze the tree structure, use the wrapper technique toextract chemical related attribute information and apply the regular expressions to extractstructural information from the unstructured data. And also, adopt the monitor technology toachieve the scheduling of the chemical substances web site, and ensure that the automaticacquisitions of online information of the chemical substance and that data are updated in atimely manner.(2)The research on the chemicals heterogeneous data integration methods. For theproblem of data heterogeneous of chemical substances in the webpage, this paper does thesethings. Firstly, according to the chemical environment safety related properties, determine theintegration range and design the public data model CompoundsDTO as the globalpattern.Secondly,use the sorting algorithm to make access to dynamic data analysis.Finally,the processed data is mapped to the global model. These procedures make the integration ofheterogeneous data, effectively eliminating the structure conflict and semantic conflict of the heterogeneous data source.(3)The design and development of chemical material and environmental safety datamanagement system. On the basis of the construction of chemical environmental securitydatabase, we apply the technology of chemicals online dynamic access and chemicalsheterogeneous data integration technology to design a data management system for the safetyenvironment of chemicals. Then we realize the automatically and timely extraction ofchemical information on Internet. In addition, we save the data with unified structure in thedatabase with the new dynamic detection technology to query database continuously.

节点文献中: