节点文献
地方志知识组织及内容挖掘研究
Research on Knowledge Organization & Content Mining of the Chinese Local Chronicles
【作者】 衡中青;
【导师】 侯汉清;
【作者基本信息】 南京农业大学 , 科学技术史, 2007, 博士
【副题名】以《方志物产·广东》为例
【摘要】 20世纪50年代,在著名农史专家万国鼎先生主持下,历时6年多,中国农业遗产研究室从全国各地收藏的6000多种地方志中摘抄并整理出“物产”资料,汇编成431册的《方志物产》,约3000万字,它基本上完整地保存了明、清和民国期间全国各地的物产史料,具有极高的农业科技、经济史料价值。本文将以这套资料为基础,探索方志信息组织的思路和方法。本文首先从方志目录学整理角度出发,主要探讨方志目录类型和志书著录方式、方法,总结我国新旧方志索引工作的成绩和特点;其次,从农史物产史料整理角度出发,对农史物产史料的来源及其整理成果进行探讨和总结。本文的重点是以《方志物产·广东》(即《方志物产》广东部分)为例,首先构建了一个《方志物产》信息系统,探索地方志知识组织和内容挖掘的方法。然后从以从信息系统中获得的有关基本数据,进行物产研究和引书研究。主要研究内容如下:(1)《方志物产·广东》信息系统的设计和构建。该系统主要包括全文数据库、物产索引子系统和引书挖掘及索引子系统等功能模块。全文数据库构建,分析志书的行文格式,提取出能够概括全部来源志书的、规范的物产行文叙述格式,作为全文数据库字段设计的依据。本文设计的全文数据库除具有全文检索这一基本功能外,还有关键词检索、聚类检索和数据统计功能。物产索引子系统,采用模式识别的方法,识别出物产的异名别称,建立物产异名标引词典,与物产正名标引词典一起,构建物产标引词典,用于物产的计算机标引和索引生成。物产索引子系统具有模式维护、异名别称识别、款目库维护、索引生成及浏览四项功能。引书挖掘及索引子系统,采用引书引用模式、引书名称特征模式、人名引用模式,挖掘引书,建立引书标引词典,用于引书的计算机标引和索引生成。引书索引子系统具有引书模式库维护、引书模式识别、款目库维护、索引生成和浏览四项功能。(2)《方志物产·广东》之物产研究,包括物产分布统计和分析、物产分类研究、物产异名别称研究。物产分布统计分析,对《方志物产·广东》中的全部物产数据按历史时期和地域进行统计和分析。历史时期的结果表明:明代每部志书所载物产数量最多,民国其次,清代最少;民国时期平均每部志书篇幅最大,清代其次,明代最小,从明代到清代到民国,志书叙述物产越来越详细。地域的结果表明:从通志到府志到县志,平均每部志书所载物产数按地域面积大小逐步递减;从粤西、珠三角、粤北、粤东,平均每部志书所载物产数按地域位置由西向东逐渐减小。物产的分类研究,对《方志物产·广东》所有来源志书的门目特征和类目特征进行分析和总结,探讨了植物、动物和货物的分类特点、类目设置得失和分类依据,在此基础上拟定一个能够类分所有物产的物产分类体系表,该表设立植物、动物和货物三个一级类目,植物类下设立13个二级类目,动物类下设立14个二级类目,货物类下设立9个二级类目。物产的异名别称研究,对从《方志物产·广东》中辑得的1418条物产异名别称的表达模式,归纳为有别称词、避忌特称、地域特称、文献特称和特殊行业特称等五种,并对其命名来源进行探讨。物产异名别称的表达模式是物产异名别称挖掘的基础。(3)引书研究,包括全部引书数据的统计分析、引书的引用方式研究。全部引书数据的统计分析,主要是针对《方志物产·广东》引用的31670次各类文献,从来源志书角度和引书角度出发,以引用频次为视角进行统计分析。来源志书角度的引书统计分析表明:从历史时期看,明、清、民国三个历史时期的每部志书的引书平均数,都是按时代顺序递增,且民国远高于其他两个历史时期;从地域范围看,引用平均数最高的是记载全省物产的通志性志书;从地域位置看,珠三角地区的引用平均数高于粤西、粤东、粤北。引书角度的引书统计分析表明:诗词歌谣俗谚,引用2141次,其来源有三:岭南本地文人作品、岭外游宦文人作品、岭南当时民间歌谣俗谚;独立成篇的论著,引用29529次,其构成特点是:大量征引以岭南方志为主的岭南地方文献,大量录引反映当时真实物产状况的《采访册》,大量征引中医药文献。引书的引用方式,本文辑录出《方志物产·广东》所有的引书名称引用模式和引用的表达模式。其中,引书名称引用模式有引用文献名称、引用作者姓名和引用作者姓名+文献名称三种,引用表达模式有前标志型、后标志型和封闭型三种。引书名称引用模式和引用的表达模式,是进行引书挖掘研究的依据和途径。另外,本文还以《岭南丛述》(物产)为例进行引书分析,这是针对该著述中除诗词歌谣俗谚以外的独立成篇的论著,以引书种类为视角进行统计分析,主要从历史时期、引用频次、地域、学科等方面,探讨该著述的信息来源及资料结构。总之,本文采用农史史料学、情报学方法和计算机技术,尝试对地方志文献的物产资料进行基于知识内容的整理,意图探索方志知识组织和农史物产史料整理的思路。本文创新之处在于:1.采用模式识别理论和方法,尝试应用于方志这类古代文献,用来识别、挖掘物产的异名别称和方志文献中的引书;2.分析、提取《方志物产》文献内容的行文格式,形成统一、规范的方志物产文献的数据库格式,以期探索基于内容分析的古籍整理方法;3.运用文献计量学方法,分析《方志物产》中的引书,试图探寻农业古籍的内容结构,为农业古籍的“辨章学术,考镜源流”提供量化研究方法。4本文针对方志文献特点,首次构建了《方志物产·广东》信息系统,用于检索方志物产文献全文、生成物产索引和引书索引,以及进行物产异名别称和引书的挖掘研究。但是,本文还存在着一些不足之处,尚待进一步研究:1.物产叙述行文格式的提取是基于人工分析的,格式的规范处理也没有完全实现计算机自动处理。因此,针对方志文献特点,开发行文格式提取和自动处理软件,是今后大规模处理方志史料工作首要解决的问题;2.引书及物产异名别称经模式识别后,尚需经人工判别,没有完全实现自动化。下一步工作是进一步完善识别功能,减少人工干预,增加自动化程度。3.本项研究采用的语料仅限广东方志的物产部分,对于《方志物产》其他省份资料,本文没有涉及,有待今后做出全面系统的物产分析和引书分析。地方志的知识组织方法和方式有多种多样,本文只是选取比较实用的全文数据库、物产索引、引书索引、物产分析和引书分析等几个方面进行研究。地方志是一座“富矿”,本文只是从中挖掘出物产的异名别称和引书,有关物产的其他方面和亡佚图书研究,没有涉猎。此外,方志中还有大量其他史料,亟待发掘。因此,地方志的知识挖掘研究是我们今后努力的方向和研究重点。
【Abstract】 In the 1950’s,the agricultural produce materials were extracted from more than 6,000 kinds of Chinese local chronicle books from all kinds of libraries of China in Wan Guoding ’s charge.The materials were compiled to a series with title of Local Chronicle:Produce, with 431 volumes,and about 30,000,000 characters.It includes each aspect of agricultural production and the main content is about the zoology and botany variety resources and the raising and cultivating techniques.The series has a strong systematic function and has preserved the agricultural production materials of Ming Dynasty,Qing Dynasty and Republic of China.Because of its extremely high value on the materials,such as the agricultural science and technology and the economic history,related domestic and foreign scholars have put emphasis on it.However,the series is a hand-written and rare copy,extremely crisp and easily broken and not convenient to be used,it appears extraordinarily important and urgent to employ modern information technology to protect,disseminate and make use of it.Taking Local Chronicle of Guangdong:Produce as the example,this thesis attempts to explore the digital methods on the Local chronicle:Produce.We construct the information organizing system of the production,and the functions of it include the full text retrieval, indexing the produce names and the cited book titles.We also have a statistical analysis of the production and the cited books in the series,stressing on the alternate names of the production,the classification of the production and the citing ways of the cited books.At last,based on the data of the cited books of LingNan CongShu,some bibliometric analysis are carried out from the historic periods of the books,the highly cited books,the original regions of the authors,and the disciplines of the cited books.Firstly,the production analysis includes a statistical analysis of the produce,a research of produce classification,and a research of alternate name. (1) The statistical analysis of the produce includes the complete production data statistical analysis based on all the produce data from Local Chronicle of Guangdong: Produce according to the historical period and the region.Historical period:Calculating and analysizing produce average and size average for each book from four periods such as the Yuan Dynasty,the Ming Dynasty,the Qing Dynasty and the Republic of China,we conclude:Only one book from Yuan Dynasty,does not have statistical significance;The most produce average is for the Ming Dynasty books, more for Republic of China,least for the Qing Dynasty:The biggest size is for Republic of China books,bigger for Qing Dynasty,the smallest for Ming Dynasty.In general,from Ming Dynasty to Qing Dynasty to Republic of China,the produce are more and more detailed,the reason is that the Chinese science and technology in modern times were developing,and that the west science and technology and culture were spreading into China in the time,which influenced on compiling the Chinese local chronicle books.Region:All the local chronicle books are firstly divided to three types as the province level and the district level and the county level.According to the statistical analysis,we learned that each book produce average decreased gradually from the province to the district to the county,which is tallied with a natural law of vast territory with abundant resources,the small with the rare.Secondly,all the local chronicle books are classified to four bigger regions as Western Guangdong,Pearl River Delta,Northern Guangdong and Eastern Guangdong.The statistical result indicated that produce average is decreased from Western Guangdong,Pearl River Delta,Northern Guangdong and Eastern Guangdong gradually.(2) The produce classification:Each produce is classified into four categories as plant, animal,mineral,goods.The items set up had several characteristics:Item name expressing multi-aspect content,the classification standard not one,item name revealing deferent level. The plant classifying basis are attribute,economical use,appearance characteristic,living condition,domestication or not,modern biology classification system.The animal classification basis are attribute,appearance characteristic,living condition,domestication or not,modern biology classification system.The goods classification basis are attribute, quality of material,manufacture way,raw material,transport mode.(3) The alternate name:Many produce have different multiple names,whose expressions and origins are various,such as having alternate name words,the tabooed naming,the region naming,the literature naming and the special profession naming. Secondly,the statistical analysis of the cited books includes a statistical analysis of the whole citation data,and a analysis of the citing way,the bibliometric analysis of the cited books from Lingnan Congshu.(1) The statistical analysis of the whole citation data is carried out from the original books of local chronicle and the cited books.A statistical analysis of the original books of local chronicle includes the historical period and the region statistical analysis.Statistical analysis of the historical period:Analysizing the citing instances of the original books of local chronicle according to four historical periods like the Yuan Dynasty, the Ming Dynasty,the Qing Dynasty,Republic of China.Only one book for the Yuan Dynasty,does not have statistical significance.The citing mean value was increasing progressively as a generation order of the Ming Dynasty,the Qing Dynasty and the Republic of China.And Republic of China’s mean value is higher than other two far,which further explaining that that the Chinese science and technology in modern times were developing and that the west science and technology and culture were spreading into China in the time,influenced on compiling the Chinese local chronicle books deeply.Statistical analysis of the region:All the original books of local chronicle were divided into four regions like Western Guangdong,Pearl River Delta,Northern Guangdong and Eastern Guangdong,and the citing instances were analysized.According to the statistical analysis,we learned that the province books have a biggest citing mean value.We know, the wider scope,the more products,the more literatures cited when the local chronicle books were compiled;and the officials hired the most outstanding scholars to compile the books,who had a rigorous writing manner and an excellent style;In addition,some authors of the private works cited broadly from the encyclopedical sources.Thus,highly-citing books were composed.Among the local regions,the most cited books is from the original books of Pearl River Delta,next is from Western Guangdong,Eastern Guangdong, Northern Guangdong in order.Statistical analysis of the cited books:All the cited books are divided into two sorts like the poetry and ballad and proverb which were scattered and unable to belong to one monograph,like the papers and the monographs.Poetries and ballads and proverbs,cited 2141 times,were the historical materials from the literature forms,which were indicated to repose sentiment by the produce at that time. There are three origins of these materials:the documents of Local LingNanner,the documents of the literators who served as officials in LingNan,the documents of the folk literature.The papers and the monographs are cited 29529 times,whose constitution characteristics are that the local literatures of LingNan were cited massively,that Interview Book are cited massively which recorded the real produce instances,that the Chinese medicine literatures were cited massively which demonstrated that the significant medicinal value of the LingNan production.(2) The citing ways:the article extracted all the title-cited patterns and the citing expression patterns from Local Chronicle of Guangdong:Produce.The title-cited patterns were comprised of the literature titles the author name,the literature title + author name. The citing expression patterns were comprised of the front sign type,the back sign type and the enclosed type.All the patterns were implemented to recognize of the cited books.(3) The bibliometric analysis was carried out to aim at the cited papers and monographs from LingNan CongShu,which cited 2296 times and 351 sorts books.The period statistical analysis revealed that the cited book sorts sequence from high to low is Period of the Song Dynasty and the Yuan Dynasty,Period of the Qing Dynasty,Period of the Three Kingdoms and Jin Dynasty and NanBei Dynasty,Period of the Ming Dynasty,Period of the Sui Dynasty and Tang Dynasty and the Five Dynasty,Period of the Qin Dynasty and Han Dynasty,Period of the pre-Qin era.The most sorts of the cited books for Period of Song Dynasty and Yuan Dynasty revealed that the science and technology in the time was most prosperous in the China feudal society.The frequency statistical analysis revealed that the highest reached 207 times,owing to an ancient book titled Guangdong Xinyu,written by Qu Dajun,which was the most value reference for Deng Chun,the author of Lingnan Congshu.The region statistical analysis revealed that the authors from the Yangtze River downstream owned the most books,that the authors from LingNan area held the highest frequency,that in other areas like Huang River valley,Two Lakes areas,the southwest,both the cited book sort and the frequency were lower,even that no one cited book belonged in the northeast area.The discipline statistical analysis revealed that the cited books of the miscellany books had the most sorts and the highest frequency,which told us that the miscellany books were the main information sources for Linnan Congshu.Others are less like the note and commentary books,the Chinese traditional medicine books,the produce and natural science books,the history and geography books,literary work corpus,the notes on poets and poetry books,the agricultural ancient books,the local chronicle books sponsored by the officials.And all the statistical analysis outlined the material origins and the content structure of LinNan CongShu.Thirdly,the production information organizing system includes the full text database,the production index,and the cited book index.(1) The full text database construction analysis the styles of writings to outline a standard narration form of the produce in the local chronicle books to design the full text database field.Full text retrieval,keyword retrieval,cluster retrieval and data statistic are the main functions of database system.(2) The production index subsystem recognize the alternate names of the production and construct product name index dictionary with the pattern recognition methods,which is applied to index the productions together with the formal production name dictionary.The subsystem’s functions are pattern maintaining,synonym recognizing,item database maintaining,index building and browsing.(3) The cited books index subsystem can dig the cited books with the citing linguistic characteristic pattern of the cited books,the linguistic characteristic pattern of the cited books name and the citing linguistic characteristic pattern of the author name,and construct a cited book title dictionary to index the cited books.The subsystem’s functions are pattern database maintaining,pattern recognizing,item database maintaining,index building and browsing.The information organization means of the Chinese local chronicles is diversified, however only a few were used in the full text database,the produce index,the cited book index,the statistical analysis of the produce,and the statistical analysis of the cited books. We expect this thesis maybe find out the methods and clues to organize information of the Chinese local chronicles.
【Key words】 Local chronicle; Local Chronicle: Produce; Knowledge organizing; Content digging; Local chronicle index; Collation of ancient books;