节点文献

命名实体识别在方志内容挖掘中的应用研究

Research on the Application of Named Entity Recognition in Content Mining of Chinese Local Chronicles

【作者】 朱锁玲

【导师】 包平;

【作者基本信息】 南京农业大学 , 科学技术史, 2011, 博士

【副题名】以广东、福建、台湾三省《方志物产》为例

【摘要】 中国方志类古籍起源早、持续久、类型全、数量多。据《中国地方志联合目录》的统计,仅保存至今的宋至民国时期的方志就有8264种,11万余卷,占中国古籍的十分之一左右。整理和使用方志资料,是我国历史上的一个优良传统。《方志物产》是20世纪50年代,我国著名农史学家、中国农史学科主要创始人之一的万国鼎先生组织数十人历时6年,人工摘抄方志整理的专题性资料。该方志资料详细记载了物产的名称、性能、作用及分布情况,具有极高的农业科技和经济史料价值。信息技术日益发展的今天,如何利用现代信息技术整理方志资料,降低开发利用的难度,已成为一个十分现实的课题。本文将以《方志物产》为基础,探索方志类古籍整理的新方法。首先从方志整理的主要内容、基本手段、现有成果三方面论述方志的整理,详细介绍《方志物产》的缘起及其手工整理和数字化整理的过程,分析目前方志整理存在的问题,进而引出本研究的目的和意义;其次从命名实体识别的概念和作用、识别的任务、中文命名实体识别的特点和难点等方面阐述命名实体识别的基本语言学知识,重点讨论命名实体识别的方法,对目前国内外已有的相关研究作总结;.然后结合方志类古籍的特点以及《方志物产》中地名的特点,制定《方志物产》地名识别方法。以广东、福建和台湾三省《方志物产》为例,构建《方志物产》地名识别系统,通过对地名识别结果的统计分析,进行《方志物产》内容挖掘。主要研究内容如下:(1)《方志物产》地名识别系统的设计和构建。该系统包括全文数据库和地名识别子系统两大功能模块。全文数据库构建,从三省《方志物产》物产叙述格式的特点出发,借鉴前人分析、提取的统一行文格式,对三省《方志物产》文本格式作规范处理,并以此为据设计数据库结构。全文数据库具有全文检索、关键词检索、聚类检索和数据统计等功能。物产地名识别子系统,采用规则与统计相结合的命名实体识别方法,结合方志类古籍自身的特点,实现物产地名的自动识别。物产地名识别子系统具有规则管理、地名识别、地名库修正、信息统计四大功能。经测试,该系统能够满足相关研究人员在方志类古籍领域进行古籍检索和知识发现的需要。系统的识别效果可通过规则的不断完善得以逐步优化。(2)《方志物产》的物产研究按历史时期、志书类型、地域位置对广东、福建、台湾三省《方志物产》的全部载述物产进行统计和分析。按历史时期统计分析的结果表明:从明代到清代再到民国时期,平均每部志书记载物产的数量呈递增趋势。按志书类型统计分析的结果表明:从通志到府志再到县志,平均每部志书所载物产的数量呈递减趋势。按地域位置统计分析的结果表明:广东、福建、台湾三省《方志物产》记述的不仅是这三省的物产,还包括海南省全部和广西部分地域的物产。(3)基于物产地名的《方志物产》内容挖掘研究,包括全部正确地名的统计分析、各省物产分布、物产传播和外来物产引进研究。全部正确地名的统计分析,基于7179条有效地名识别记录。各省《方志物产》地名识别结果分别按省内地名、省外地名、国外地名和宽泛地名分类统计。统计分析的结果表明:相比其他两省,台湾省同外界的物产交流、传播相对更为广泛。各省物产分布研究,基于相关统计数据,详细分析了广东、福建、台湾三省物产的具体分布情况,并利用ArcGIS软件绘制物产分布专题地图,全面、直观地显示相关内容。研究结果表明:决定一个地域物产多样性的主要因素有两点,一是该地域的自然因素,包括其地理位置、自然环境和气候条件;二是该地域的人文因素,包括人类对自然资源的开发与利用、外来物产的引进和传播。各省物产传播研究,基于相关统计数据,详细分析了广东、福建、台湾三省物产的传播概况,同样利用ArcGIS软件绘制专题地图,进行全面、直观的显示。研究结果表明:地区间物产交流和传播的广度随地区间距离的扩大呈递减趋势。距离越远,物产交流和传播相对越少各省外来物产引进研究,基于相关统计数据,分析、比较了广东、福建、台湾三省外来物产的引进概况。研究表明:促进物产引进和传播的原因有两点,一是地区间的贸易往来。二是殖民侵略和战争。(4)基于识别规则的《方志物产》内容挖掘研究,包括全部识别规则的统计分析、物产分布比较研究、物产引进和传播途径研究。全部识别规则的统计分析,同样基于7179条有效地名识别记录。根据规则表达的含义,将识别规则分为识别物产分布地名的规则和识别物产引进传播地名的规则两类,各类分别加以统计。物产分布比较研究,基于识别规则的相关统计数据,挖掘出志书对物产原产地、分布地、各地物产孰优孰劣、孰多孰少等相关内容的描述,进而归纳出部分物产的原产地、优产地和高产地。物产引进和传播途径研究,基于识别规则的分类统计数据,总结出明清时期外来物产引进和传播的主要途径:一是对外贸易,二是朝贡,三是朝廷使者或僧侣传入。总而言之,本文以农史资料《方志物产》为语料,将信息组织的理论、方法借助于命名实体识别技术实现《方志物产》的地名识别,通过对识别结果的文献计量学分析,进行《方志物产》内容挖掘研究,旨在探索一种基于内容的古籍整理新方法。本研究所做的主要工作和贡献在于:(1)将命名实体识别相关理论和方法尝试应用于方志类古籍文献,用来识别、挖掘方志文献中的地名;(2)运用文献计量学方法,分析《方志物产》地名识别结果中的物产名、物产地名和识别规则,获得物产分布、物产引进和传播等相关知识,实现基于内容的古籍数字化整理;(3)借助GIS专题地图,直观显示《方志物产》中物产分布、物产引进和传播等知识内容,突破传统的文字表述模式,使方志类古籍这一历史文化资源的时空特性得以充分揭示。命名实体包括人名、地名、组织机构名等,本文重点是对广东、福建和台湾三省《方志物产》中的地名进行识别,其他的诸如志书名称、成书年代、物产名称等命名实体是文档处理过程中采用机器辅助粗分出来的。今后可通过修改或重新录入、组织规则,实现对其他省份的方志资料,或其他类型的古籍资料进行地名以外的人名、官职名、机构名等其他命名实体的识别研究,以求从多角度挖掘和利用古籍资料,为现代工农业生产和科学研究提供史料参证。

【Abstract】 Ancient books, such as Chinese local chronicles, have very early origins and also continued for a long time. These ones have all kinds of types and a large number. According to the statistics of Union Catalog of Chinese Local Chronicles, about more than 110.000 volumes of 8264 kinds of Local Chronicles, which account for about one-tenth of Chinese ancient books, are still preserved, and they are only the ones compiled from the period of Song Dynasty to Republic of China. Collecting and using Local Chronicles is a good Chinese tradition in history. In the 1950’s, Wan Guoding, the famous historian of agriculture and one of the principal founders of the subject of Chinese Agricultural History, led dozens of people to extract and finish the thematic material named Local Chronicle: Produce. These materials have great value in the field of agricultural science and technology and also the field of economy as they recorded the information about the names, performances, uses and distributions of products in detail. Nowadays, in the information age with the rapid development of information technology, how to use these techologies to collect materials about local chronicles and reduce the difficulty of exploitation at the same time, has become a realistic subject. Based on Local Chronicle:Produce, this paper attempts to explore a new method to collect ancient books such as local chronicles.Firstly, the author focuses on the main contents of the collection of local chronicles, varied kinds of methods on the behaviour of collection and also the existing research achievements. Then, this paper elaborates on the origin of Local Chronicle:Produce and gives an account of the process of collecting Local Chronicle:Produce both by hand and digitally. After this, problems on local chronicles collecting are analyzed and the purpose and meaning of the present research is brought out. And then the paper introduces some basic linguistic knowledge about the concept, the role, the task of recognizing as well as the characteristics and difficulties of named entity recognition. The author also summarizes the current related researches both at home and abroad and discusses the methods of named entity recognition. At last, the author formulates the method of location names recognition from Local Chronicle:Produce according to the characteristics of Chinese local chronicles and the location names in Local Chronicle:Produce.Based on the Local Chronicle:Produce of Guangdong, Fujian and Taiwan, this paper focuses on the construction of a recognition system of location names in Local Chronicle: Produce, and also the exploration of the method of content mining of Chinese local chronicles. Then, according to the statistics about the related recogniton results, the author has a research on products, location names and rules. The main contents are as follows:(1) The recognition system about location names in Local Chronicle:Produce includes two function modules of full-text database and the location names recognition subsystem.The construction of the full-text database:Based on the characteristics of the statement format of Local Chronicle:Produce of Guangdong, Fujian and Taiwan, this paper makes a standard textual format and also designs the structure of database, drawing on previous analysis. And the full-text database has the functions of the full text retrieval, key words retrieval, the cluster retrieval and the data analysis.Recognition subsystem of location names in Local Chronicle:Produce:it uses the Rules-based and Statistics-based method to achieve automatic recognition of location names about products, combining with the local chronicles’own peculiarity. The subsystem has the functions of the rule management, the location names recognition, the database of the location names and the statistics of the information. After some tests, it proves that the system can meet the needs of the related researchers on ancient books retrieval and knowledge discovery. And the recognizing effect will be optimized by improving and perfecting the rules gradually.(2) Analysis and research about the production of Local Chronicle:Produce:This article makes a statistics and analysis about all productions recorded in Local Chronicle:Produce of Guangdong, Fujian and Taiwan from the sides of the period of history. the types of local chronicle and also their regions. The result which is counted from historical period shows that the average number of products recorded in each local chronicle is increased progressively from Ming Dynasty to Qing Dynasty and then to Republic of China. The result counted from local chronicle’s types shows that the average number of products recorded in each local chronicle is gradually decreasing from the province to the district and then to the county. Counted from regions, the statistical result shows that regions of productions in Local Chronicle:Produce of Guangdong, Fujian and Taiwan not only contain the products in the three provinces, but also all the ones in Hainan Province and part fields of Guangxi Province.(3) The research of the content mining of Local Chronicle:Produce, based on the location names,includes the statistics and analysis about all the correct location names, the distribution of the products in varied provinces, the propagation of the products and also the introduction of products that are introduced from other places.All the correct statistics and analysis are based on the 7179 operative recognition records of location names. Provinces classify and analyse the records according to the names in the provinces, outside the provinces, abroad and also the names which covers wide fields. Statistical analysis shows that compared to the other two provinces, the exchanges and the communication that Taiwan Province has with the outside world is relatively wider.Based on the relevant statistical data, the research about the distribution of the prodcts, analyses the specific distribution of products in the provinces of Guangdong, Fujian and Taiwan, and uses ArcGIS software to draw thematic maps, so the relevant content can be showed comprehensively and intuitively. The result shows that there are two main factors which determine the diversity of local products. The first one is the region’s natural factors, including its geographical location, natural environment and climatic conditions. The second one is the human factor in the region, including the development and utilization of natural resources and also the introduction of the products from other places.Based on the relevant statistical data, the research about the dissemination of provincial products, analyses the spread of the products in the provinces of Guangdong, Fujian and Taiwan in detail, with the same ArcGIS software to draw the thematic maps. The result shows that the range of the products’inter-regional exchange and dissemination reduces gradually with the expansion of the distance between the regions. The farther the distance does, the less exchange and dissemination the products will do. Based on the relevant statistical data, the research about the introduction of the products from other places, compares the introduction situation of the Guangdong. Fujian and Taiwan provinces. The result shows that there are two reasons to promote the introduction and spread of the products. The first one is the trading between the regions. The second one is the colonial aggression and war.(4) Based on the recognition rules, the researches of the content mining of Local Chronicle:Produce include the reseach about the statistical analysis of all the recognition rules, the comparison of the products’ distribution in varied provinces and also the research about the way of the products’ propagation and introduction.All the statistics and analysis are based on the 7179 operative recognition records of the recognition rules. According to the meaning that the rules express, the system classify these recognition rules to two types, the rule to identify the distribution names of the places that the products distribute, and also the rule in order to identify the places where the products are introduced from.Based on the statistical data related to the recognition rules, this paper discusses the distribution of the products, shows the details about the products’ places of origin, places where they distribute, their merits and also their accounts that the local record describes. And it also summarizes part of the products’ origin places and high-yield places.Based on the statistical data related to the recognition rules, this paper also explores how the products are introduced from other places and how they are spreaded to other ones. It summarizes three main ways for the products to be introduced and spreaded in the Ming and Qing Dynasties. The products can be introduced and spreaded by foreign trading, the way of tribute, or be passed by the monks.In short, this paper takes Local Chronicle:Produce as corpus and realizes the location names recognition of Local Chronicle:Produce by using the named entity recognition technology. Based on the bibliometric analysis on the recognition results, the paper researches on the content mining of Local Chronicle:Produce in order to explore a new method of collecting ancient books based on the contents. The innovations of this paper are:(1) It uses the theories and methods about named entity recognition on ancient books, such as Chinese local chronicles, to recognize location names from Chinese local chronicles.(2) It analyzes the products’ names, location names and recognition rules from recognition results of Local Chronicle:Produce by bibliometric method. Knowledges about products’distribution, propagation and introduction are acquired And it achieves the digital collection of ancient books, which based on the content.(3) It uses the GIS thematic maps, so that the distribution and the introduction of the products in Local Chronicle:Produce be showed more intuitively. It breaks the traditional mode of written expression, so that the space feature of the chronicles can be fully revealed.Named entities include person names, location names and organization names and so on. This paper just recognizes the location names in Guangdong, Fujian and Taiwan provinces of Local Chronicle:Produce. And in the future, the one can do some researches on the recognition of other entities like person names, organization names an so on by modifying or re-entry, re-organize rules, so that the one can mining and use the ancient information from multiple perspectives, providing the industrial and agricultural productions and scientific researches the historical reference evidence.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络