节点文献

基于本体的中国行政区划地名识别与抽取研究

Study on the Ontology-based Extraction of the Names of Chinese Administrative Division

【作者】 杜萍

【导师】 刘勇;

【作者基本信息】 兰州大学 , 地图学与地理信息系统, 2011, 博士

【摘要】 Web的不断发展和日益普及使得Web网页的数量飞速增长。众多的Web网页蕴含着丰富的地理信息。充分挖掘Web上的地理信息一方面可以满足人们对地理信息的查询及检索需求,另一方面能够促进基于位置的服务等新兴领域的发展。中文地名是中文网页中数目最多、最为常见的地理信息。本研究在自然语言处理的基础上,借助构建的中国行政区划地名时空本体,将中国行政区划地名从Web文本中标识出来,通过geo/non-geo和geo/geo地名歧义的消除,使之与地球表面具体的地理位置相对应,进而为Web文本中的中国行政区划地名赋予地理坐标和地理语义,并以地理可视化的方法建立该Web文本与地图中空间位置的关联关系。目前国内对中文地名的识别与抽取多是从自然语言处理的角度,仅限于中文地名的初步识别,缺乏地名歧义消除处理,使得识别结果无法应用于地理信息服务领域。虽然有学者从事地理时空本体和中文地名识别与抽取的研究,但是目前还没有将这两者有机地结合到一起,重点关注地名歧义消除的清晰论述。本文建立了一个基于本体完成中文地名识别与抽取的理论体系框架,并基于该框架设计并实现了一个中国行政区划地名识别与抽取原型系统。本研究的主要成果包括:①在介绍和综述本体、地理本体、空间本体等概念的基础上,根据顶层本体——基础形式本体BFO,运用部分—整体学、定位理论和拓扑学基本理论,建立了一个包括BFO-SNAP和BFO-SPAN两个成分的地名时空本体模型,并将该模型作为建模框架,完成了能够形式化表达地名变更及地名演化时间特性的中国行政区划地名时空本体的构建。②运用文本工程通用框架GATE,利用基于本体的信息抽取方法,设计并实现了一个基于本体的中国行政区划地名识别与抽取原型系统。该系统使得中国行政区划地名这种间接的地理空间参照具有精确的地理坐标,在一定程度上消除了自然语言中非结构化空间信息与GIS结构化空间信息之间的语义障碍。③分析了中国行政区划地名歧义的特点及产生原因,将中国行政区划地名存在的歧义区分为geo/non-geo歧义和geo/geo歧义两种,并进一步将geo/geo歧义分为两类:有行政隶属关系的地名使用同一个特称地名、无行政隶属关系的地名使用同一个特称加通称地名或特称地名。④设计了有效的基于本体的geo/non-geo和geo/geo歧义消除算法,以消除Web文本中广泛存在的中国行政区划地名歧义。算法不识别Web文本中具有geo/non-geo歧义的中国行政区划地名,并为识别出来的具有geo/geo歧义的中国行政区划地名指定唯一的地理位置。⑤根据中国行政区划地名时空本体,为Web文本中的无歧义中国行政区划地名进行语义标注,赋予它们地理语义及地理坐标,并实现了Web文本中中国行政区划地名的地图可视化。

【Abstract】 The number of web pages has been growing rapidly with the development of World Wide Web. However, a huge quantity of geographic information resources is hidden in the billions of web pages and waits to be mined. Fully exploiting the geographic information on the web not only meets people’s geographical query and retrieval needs, but also contributes to Location-Based Services(LBS) and other emerging fields. The Chinese place names are a kind of major geographic information resources on the web. In this study, the names of Chinese administrative division are extracted from web pages based on a series of basic theories and methods, such as natural language processing, geo-ontology, eliminating geo/non-geo and geo/geo ambiguities, and geo-visualizing representation.At present, many researches on extraction of Chinese place names just stand at the viewpoint of natural language processing, stopping at the preliminary recognition. These researches lack disambiguation of ambiguous place names, making the results of extraction can not be used in geographic information services. Although some scholars have engaged in the study of geographical spatio-temporal ontology or recognition of Chinese place names, there was no any clear comment and detail theory about the combination of these two areas together organically, while focusing on the disambiguation of place names. This dissertation establishes a better theoretical framework on Chinese place names recognition and extraction based on place name spatio-temporal ontology. A prototype system is designed and implemented based on the framework.The main results of this study include:①On the basis of introduction and review of ontology, geo-ontology. spatial ontology, etc., a model of place name spatio-temporal ontology which consists of BFO-SNAP and BFO-SPAN is designed based on Basic Formal Ontology using mereology, location theory and topology, and a Chinese administrative division spatio-temporal ontology which can express changes and time characteristics of the evolution of place names formally is constructed.②The names of Chinese administrative division extraction prototype system is designed and implemented using the method of ontology-based information extraction under GATE environment. The system turns the names of Chinese administrative division which are indirect geospatial information to precise geographical coordinates, removing the semantic barriers between unstructured spatial information in natural language and GIS structured spatial information to a certain extent.③After analyzing the characteristics and causes of the ambiguities existed in the names of Chinese administrative division, the ambiguities are divided into two types:geo/non-geo and geo/geo. The geo/geo ambiguity is further divided into two categories:places with the administrative relationship using the same special names, places without the administrative relationship using the same name.④Two effective algorithms are designed in order to eliminate widespread ambiguities in the names of Chinese administrative division in web texts. The names of Chinese administrative division which have geo/non-geo ambiguities are not extracted while those have geo/geo ambiguities are extracted and specified unique locations.⑤Rich semantics and precise geographical coordinates are given to the extracted names of Chinese administrative division which are unambiguous according to Chinese administrative division spatio-temporal ontology, then the names of Chinese administrative division are plotted on a map to visualize.

  • 【网络出版投稿人】 兰州大学
  • 【网络出版年期】2012年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络