

Research on Web-based Spatial Data Grab and Evaluation

【作者】 王明军

【导师】 杜清运;

【作者基本信息】 武汉大学 , 地图制图学与地理信息工程, 2013, 博士

【摘要】 Web技术的飞速发展,为人们提供了丰富的信息,同时带来大量的信息冗余。如何快速定位用户需求,是目前网络检索中常见的问题之一。尤其在空间信息领域,空间数据涉及几何与属性两种信息,这种信息的独特性,在网络环境下只能通过文字描述信息与几何图形信息两方面分别表现。当前,对于空间信息的检索,主要集中在文字描述匹配方面,针对空间几何信息检索研究相对较少。本文在分析当前网络环境下空间信息检索存在问题的基础上,探讨了解决空间信息检索所涉及的主要研究领域,以及这些领域国内外的研究进展。论文从网络信息爬取入手,讨论空间信息在网络化环境下的主要特征与分类体系,探讨不同类型空间数据的解析与识别方法,针对不同数据类型与对应页面,阐述数据置信度度量基本方法,同时扩展空间数据分类体系,提出爬取空间数据分类标签体系思想,基于此体系,实现空间数据存储管理与后期应用,最后通过实例模型验证了空间数据爬取的某些过程,并做了相应质量评价与分析。论文针对不同空间数据类型,深入探讨了基于空间信息敏感爬虫爬取数据的基本原理与方法。首先引入空间敏感爬虫概念,介绍其与传统爬虫的异同与工作流程,以及空间敏感页面和网页链接空间信息与空间检索词的相似度度量。其次重点论述了不同类型空间数据发现机制,即空间数据服务、栅格、矢量及其他数据的发现方法,针对不同类型,讨论其在网页中的表现形式,解析的基本过程,其中对涉及主要算法与模型,给出了必要说明与阐述。论文提出了Web空间数据的置信度度量方法。Web空间数据由于描述信息缺乏,其数据质量很难准确衡量,后期数据检索与应用相对困难。结合空间数据质量的一些基本方法,综合考虑空间数据文本描述与数据本身信息,提出了定性度量矢量、栅格数据的方法。其次,对不同空间数据类型置信度做了分析比较,对链接到同一空间敏感页面的不同资源,选取较大置信度对整个页面最佳匹配。论文结合元数据模型与目前空间数据分类体系,提出了Web空间数据的分类标签思想。Web环境下空间数据由于表达尺度、范围、要素等等差异,很难采用传统的分类体系对其划分,必须采用新的方式记录其数据描述信息,借助元数据模型及数据应用相关的分类体系,提出了分类标签体系模型。在此基础上,对Web数据获取后,数据的存储管理,后期数据检索与应用做了简单说明。通过实例模型,对整个空间敏感爬虫从页面过滤,到信息提取,再到质量的基本评价,进行了必要的验证。分析、总结了相关理论与实践之间存在的不一致性问题,表明了网络空间数据爬取问题的复杂性,为后续研究奠定一定的理论与实践基础。最后论文对基于空间信息爬取基本整体流程的各个环节进行了总结,提出了下一步研究的几个方向。

【Abstract】 Just as every coin has two sides, so does the rapid development of the Web technology. Through the Web technology, such as surfing on the internet, people can read abundant useful information worldwide. Meanwhile the readers have to receive the huge number of redundant online information either, especially in the field of geospatial information. The geospatial data including both attribute information and geometry information, which is special and unique from other kinds of data, can only be represented by the description of texts and geometric graphs. And so far, the main focus on the retrieval of geospatial information is the description and matching of texts, while less focus on that of geometry information.This paper aims firstly to analyze the problems on the retrieval of geospatial information and review the related study progress worldwide. Based on the former analysis and review, this paper secondly studies on the resolving of different geospatial data from the Web page, and discusses the basic methodologies to measure the degree of confidence to the page and different spatial data. Moreover, the paper extends the classification of spatial data and proposes the categories and tags system to web spatial data, and based on this system, it can help to save, manage the large number of web spatial information and data applications. At last, the paper gives some cases to verify the process of how to grab spatial data, and some evaluates and analysis to the quality of relevant spatial data.Furthermore, based on the sensitive crawler of geospatial information, this paper discusses the strategy of algorithm and the solution scheme for each step of grabbing the geospatial data. As an important aspect, it is studied that the analytical method of parsing the web pages by the geospatial sensitive crawler. This analytical method is based on the statistical methodology, and different algorithms can be applied to carry out the principle of computation of the spatial correlation in order to get the high sensitive web pages of geospatial information. In addition, this paper further studies both on the Web service discovery and the parsing of geospatial information. The Web service discovery of geospatial information refers to three versions of service description, i.e., OWL-S, WSDL and OGC Capabilities. The OGC Capabilities is a specification, which is a mature and well-known in the field of geospatial information service. After that, the parsing of different types of geospatial data is also discussed, such as the basic parsing of raster data, the basic parsing of vector data, and the basic parsing of data interchange formats, etc.Moreover, based on the above study, this paper analyzes the fusion methods of geospatial data, which is grabbed by the high sensitive web pages of geospatial information, and further illustrates and studies on some of the different fusion methods. It is also introduced that both the taxonomy system and the standard system, proposed by different organizations, of the geospatial information service. The geospatial information service contains most of the current web service, and this study can provide as a reference for the discovery and compositing of Web service. Additionally, it is respectively introduced that the methods of the quality evaluation and the fusion of the raster and the vector data. Comparing to the vector data, the data structure of the raster data is relatively simple, so that there are more types of fusion method of the raster data than that of the vector data while the progress of the development of fusion method of vector data is relatively slow. Then, it is also complemented that the visualization of non-spatial information belonging to the geospatial information.Moreover, combined the metadata model and the classification of traditional spatial data, this paper proposes the classifications and tags system to web spatial data. For the web spatial data, there are the differences of expression scale, data extents, elements, etc., it’s difficult to adopt the traditional categorization to classify the web spatial data and must use newIn addition, by the case studies, it is validated that the whole process of the geospatial sensitive crawler, in the sequence of the page filters, the information retrieval and the data quality evaluation. Then, it is analyzed and concluded that the inconsistency between the theory and the results of the case studies. These results can indicate the complexity of grabbing geospatial data and laid the foundation for the further works.Finally, this paper proposed some suggestions and related research directions based on the conclusions of each step of the overall flow about the crawling the geospatial information.

  • 【网络出版投稿人】 武汉大学
  • 【网络出版年期】2014年 07期
  • 【分类号】P208;TP391.3
  • 【被引频次】2
  • 【下载频次】404
  • 攻读期成果

