

Research on Intelligent Web Search Engine of Unstructured Spatial Information

【摘要】 非结构化数据占据了网络信息资源的大部分内容,它是网络搜索引擎的主要数据来源和研究对象。非结构化空间数据是网络信息资源的重要组成部分,研究非结构化网络空间信息智能搜索与服务是通用搜索引擎在空间信息领域提供专业化信息服务的主要研究内容。它是搜索引擎技术与WebGIS等技术相结合的产物,可以为普通用户提供本地信息服务(Local Service)和空间信息检索工具,符合当今信息检索技术朝着智能化、个性化方向发展的潮流。 作为“863”项目“空间信息智能网络搜索技术”的延续,本文以网络搜索引擎技术为基础,结合自然语言处理、GIS和信息提取等技术,对非结构化Web空间信息的智能获取、加工、服务方法进行了深入、系统的研究和实践。按照文本粒度的大小,本文分别在词、句、篇、篇层等层面上研究了空间命名实体的识别、空间语义分析、空间概念提取、锚文本层次结构语义索引等关键技术。利用这些技术,本文设计实现了地图网页搜索系统、“词虎”搜索器及“文图智通”的原型系统,并将这些技术和方法融入到非结构化Web空间信息智能搜索与服务系统(SIISE)的设计和实现中,初步构造出一个完整的空间信息搜索系统雏形。具体说来,主要开展了以下研究工作: [1] 研究了海量空间命名实体(SNE)在线识别问题。在分析一般命名实体识别方法的基础上,提出利用SNE的空间特性、采用地理编码的手段在线识别单句、全文中SNE的技术思想。对于单句,利用基础地名词典进行切词,通过编码分析和SNE单元合并的策略进行识别;对于全文,利用全文粗扫描获取相关的地理编码,通过编码分析锁定文中涉及的空间范围,然后按照一定的策略自动加载匹配词典识别文中其它SNE。实验表明,这种方法能识别出大量在词典中不存在的组合式SNE,系统具备一定的自适应性,较好地解决了因命名实体词典数量庞大而导致的低效率问题。 [2] 研究了自然语言中的空间语义分析与空间概念提取方法。根据汉语表达空间概念的特点以及GIS表征空间信息的特点定义了空间语义角色,并利用空间语义角色定义了空间概念的形式化描述方法,提出了利用空间语义角色分析自然语言中的空间语义和空间概念基本思路。方法是:先构造空间语义词典,采用浅层句法分析的原理,通过空间语义角色标注、短语识别以及概念模式匹配等手段提取了文本中的空间概念。初步实验显示,该方法具有较好的准确率,召回率还有待提高。 [3] 探索了锚文本层次结构语义索引检索机制。在深入剖析锚文本的特征以

【Abstract】 Unstructured data occupies a large part of Web information resources. It is the main data source of Web Search Engine. As an important component of Web resources, unstructured spatial data is the major research content of Geo Search Engine (GSE), which is regarded as the embranchment of general Search Engine. GSE combines WebGIS with Search Engine, It can provide Local Service to common users and can satisfy us with geo-related information, in accord with the current trend of information retrieval towards intelligentization and individuation .As a continuation of the "863" program "Intelligent Web Search Engine for Spatial Information", the dissertation, based on the technologies of Web Search Engine, Natural Language Processing (NLP), GIS and Information Extraction (IE), makes an in-depth and systematic study on acquisition, processing and services of unstructured spatial information. It focuses on the key technologies and approaches of SNE recognition, spatial semantic analysis, spatial concept extraction, semantic indexing and retrieval of anchor texts hierarchical structure, in accordance with different grades of text size: word, phrase and sentence. By making use of these basic research results, the dissertation implements prototype systems like Map Page Search Engine, SNE Searching and WenTuZhiTong. Finally, an integrated prototype of Intelligent Web Search Engine of Unstructured Spatial Information (SIISE) is constructed. The main contributions and innovations of this dissertation can be concluded as follows:[1] Summaries of current research status on Geo Search Engine, spatial concepts extraction and semantic indexing are made.[2] Solutions to recognize Chinese SNE online are given. By means of geo-coding, the dissertation presents an approach to recognize new SNE (Chinese), which are not existed in gazetteers, from online web pages. The Experiments show that it has good efficiency. The algorithm is now applied to the system of SNE Searching, which is a client of CiHu software system.[3] Definitions of spatial semantic roles are put forward according to Chinese
