节点文献

汉字字形形式化描述方法及应用研究

The Stduy of Formal Description of Chinese Character Glyph and Application

【作者】 林民

【导师】 宋柔;

【作者基本信息】 北京工业大学 , 计算机应用技术, 2009, 博士

【摘要】 在汉字信息处理领域,现有的各种汉字字形形式化描述方法主要以文字研究和汉语教学研究中描写汉字形体结构的结构分析法为基础,采用人认知的结构类型、部件、笔画等构形单位对汉字字形进行分层描述。这些方法在字形拆分规则、结构类型划分、描述基元选取等方面存在着歧义和描述缺失,无法满足统一描述各种汉字(包括错字、古籍异体字、民俗拼合字)字形的需要,也无法支持字形自动比对计算处理,不能满足以字形比对计算分析为基础的各种应用需要,如教学研究中错字描述及偏误定量分析、古籍字形描述及比对分析、数字图书中生僻字形检索等。基于统计机器学习的汉字识别模型,对事先无法收集样本的错字、异体字、拼合字等特殊汉字,由于没有训练样本可学习,无法支持这类汉字的分类计算。对于可收集训练样本的一般汉字,识别模型中采用的字形统计特征难以逻辑解析来与人认知的字形结构类型、部件、笔画建立对应关系,是一种“黑盒”字形描述模型,无法支持面向人的各种字形比对分析应用需要。上述问题归结为汉字缺少统一有效的字形形式化描述和字形比对计算方法。本文工作围绕这一核心问题展开,面向字形比对分析应用建立了一种汉字字形描述方法及一组相关的字形比对算法和实用工具。主要创新性工作包括:1)提出一种笔段网格汉字字形形式化描述方法,用预先定义好长度、方向的直线段——笔段作为描述字形的基元,基元颗粒度适当、规范、无歧义,能统一描述一切可能今文字(包括错字、异体字、拼合字)字形骨架的异同。论证实验表明,这种方法与相同基元量点阵字形相比,描述同一汉字所需的有效基元更少,字形比对计算效率更高;描述不同汉字的字形间区分度大,有利于提高字形比对计算的准确性和可靠性,具有较高的性能代价比。2)基于笔段网格字形描述方法,本文进一步提出一组字形比对算法。其中,笔段上下文字形比对算法,以笔段为比对单位,在GB2312字符集汉字和部分错字、异体字上的测试实验表明,算法无需进行训练就能比对字形相似性,字形相似性比对结果受汉字结构类型、笔画划分影响小,在输入字形和比对字形网格大小一致时比对准确率可达100%;基于笔段组合的字形比对算法,在笔段网格字形描述基础上,能自动提取简单笔画、复合笔画,既能按简单笔画为单位进行字形比对,也能按复合笔画、简单笔画自适应进行字形比对。在同样测试汉字集上实验表明,基于简单笔画和复合笔画的字形比对算法无需训练就能进行字形相似度比对计算,比对结果对输入字形整体大小变化、斜笔画不同变形的敏感性降低,对依照约束描画的结构规范字形,比对准确率很高,可达到100%;比对单位大,比对效率高,可以适应大规模汉字字形的比对、查找;比对单位容易与人认知的构字单位建立对应关系,是一种“白盒”字形相似度比对计算方法,既适用整体字形比对,也适用局部字形比对,对结构比例失调较大的不规范字形能发现与结构规范字形的差异性,适合面向字形分析的应用需要。此外,建立了基于笔画关系矩阵的汉字结构关系描述和计算方法,可用于支持汉字结构类型的自动判别。3)由于汉字部件在汉字形体结构研究中的重要性,本文提出了在笔段网格描述的简单笔画上,附加组合关系标注的部件描述方法及部件自动发现算法,实验表明,该算法能很准确发现包含特定部件的汉字,而不受部件在字形中位置和大小的影响。4)本文还改进了《汉字信息字典》的汉字结构描述体系,提出了基于结构描述的字形相似度比对算法,实验表明,该法找到的相似字结构类型一致性好,与人认知的相似字吻合度较高(96%以上),适合结构类型划分无歧义汉字的相似性计算。5)本文最后设计实现了一个实用软件系统——汉字字形描述和自动比对分析工具,采用大众化手写描画方法来建立笔段网格字形描述,可以输入各种可以想见的汉字,包括错字、异体字和拼合字及其它相关信息,能自动将笔段网格字形转换成对应TrueType字模,与标准字符集内汉字一样被处理。对笔段网格字形可以自动进行整字、局部的字形比对,找出按相似度大小排序的相似字。采用这一工具完成了GBK字符集20902个汉字及北京语言大学留学生错字的描述,字形库应用于汉字教学错字偏误分析。这些工作有益于汉字字形描述的标准化,在基于汉字字形计算的各种应用领域:如标准字符集外汉字的输入、我国数字图书馆建设、汉语教学研究和国际推广、汉字文化历史研究、社会管理信息化等具有应用前景。

【Abstract】 In the field of Chinese characters information processing, the present approaches to the formal description of Chinese character glyph are mostly base on structure analysis method used for describing the topography of Chinese characters in the research on Chinese characters and teaching of Chinese, where strategic descriptions are adopted by applying the human perceptive units, viz. glyph formation units such as types of structure, components and strokes. These methods result in ambiguities and description deficiency with regard to glyph resolution, structure classification, and selection of descriptive elements, therefore they can not meet the need to describe any possible glyph skeletons (including wrongly written characters, variant forms of characters in ancient literatures, and combined-characters), nor can they support automatic computation of glyph comparison, let alone to meet the practical need based on glyph comparison and analysis, such as the description of wrongly written characters or the quantitative analysis of misused characters in the teaching and research of Chinese characters, the description and analysis of variant forms of characters in ancient literatures, or the retrieval of rare character glyphs in the electronic books and so on.For special Chinese characters the glyph samples of which can not be collected in advance, such as wrongly written ones, variant forms in ancient literatures, and combined-characters, since no sample training can be done, comparative computation of the glyph cannot be supported and the recognition and identification of them cannot be guaranteed. It would also be difficult for the glyph features generated by statistics, which are adopted by recognition models, to logically resolve and map to the structure types of characters, components and strokes derived from human cognition. They are rather blackbox-like, and they do not meet the demand to human-oriented comparison and analysis of different types of glyph.With regard to the core issue of the lack of universally accepted effective means of the formal description and automatic glyph comparison computation of Chinese character glyph, this paper, oriented from the application of comparison and analysis of Chinese character glyphs, offers a new approach to describing them and provides a set of algorithms of related character glyphs comparison and some practical tools. The main innovative includes:1) A method is offered formally describe Chinese characters by a stroke-segment-mesh, which uses a line-segment of pre-defined length and direction as a glyph description element (stroke segment). Since it is equipped with suitable granular degree, free of ambiguity, and standardized, it can describe the glyph skeleton of all Chinese characters (including wrongly written characters, variant forms of characters in ancient literatures, and combined-characters). Experiments show that, compared with dot-matrix glyph, which have the same amount of element, the number of effective elements reduces a great deal in the stroke-segment-mesh glyph description, and yet a higher efficiency is achieved. What’s more, the accuracy and reliability of computation are improved thanks to a higher discrepancy degree between different Chinese character stroke-segment-mesh glyphs.2) Based on stroke-segment-mesh Chinese characters formal description method, a set of glyph comparing algorithm is presented. The algorithm of glyph comparing by stroke-segment and its context uses stroke-segment as comparing unit. The experiments on the GB2312 character set and some wrongly written characters, variant forms of characters, and combined-characters show that the results of glyph similarity comparing are less affected by the factors such as character structure types and strokes division. Free of training,the algorithm can compare character glyphs, and has a high rate of accuracy when the input character is basically the same size as the compared one. The algorithm of glyph comparing by the combination of stroke-segments, based on the stroke-segment-mesh, can automatically extract simple strokes, compound strokes. It uses simple strokes, or compound strokes and simple strokes adaptively as comparing unit. Experiments on the same character set of Chinese show that the algorithms based on simple stroke and compound strokes can also compute the similarity between character glyph without training, and the result is less subject to the size and different deformation of inclined strokes. The algorithms enjoy a high accuracy rate (nearly 100%) when choosing the first candidate from input glyphs of normal structure. The algorithms use bigger glyph comparing unit and can be applied for large-scale Chinese characters glyph searching with high efficiency. The comparing unit adopted can be easily mapped to the units in human cognition, and it is a"white-box" approach to glyph similarity computation. The method can be applied to the comparison of an entire Chinese character or part of it. It can find the differences between characters of non-standard structure with standardized structure characters, and therefore it can meet the needs of glyph-analysis-oriented application.The description and computation method of the structure relationship, based on the relationship matrix of strokes, are also provided, which can be used for the automatic identification of structure types of Chinese characters.3) With regard to the importance of components of Chinese characters in the research of physical structure of them, a component description method and the algorithm of automatically detecting components are attached to simple strokes of stroke-segment-mesh glyph. Experiments show that the algorithm can accurately detect the Chinese characters that have specific components, free from the influence of the location and the size of the components in the glyph.4) This paper also improves the description system of Chinese character structure of "Chinese character information dictionary", offering an algorithm for the calculating glyph similarity of Chinese characters based on structure description. The experiment results show that the similar character lists found by this algorithm have a high degree of consistence on structure and conform to human cognition. Therefore, the algorithm is suitable for similarity calculation of Chinese characters of definite structure classes.5) In this paper, an application software system– Toolkit of Chinese Character Glyph Description and Automatic Comparison and Analysis is designed and implemented, The tool creates a stroke-segment-mesh glyph description by popular hand-written and drawing method. Any imaginable Chinese characters can be put in, including wrongly written characters, variant forms of characters in ancient literatures, combined-characters, and other related information. The stroke-segment-mesh glyph can be automatically transformed to corresponding TrueType font, and processed just like those in the set of standard Chinese character. The tool can make a comparison among stroke-segment-mesh glyphs and find their similarities and differences as a whole or as part, and can find a similar character lists sorted by similarity. The work of creating 20,902 Chinese characters stroke-segment-mesh glyph description in GBK character sets and wrongly written characters written by foreign students studying in Beijing Language and Culture University has been completed by this tool. The Chinese characters glyph database has been applied to the analysis of spelling errors made by foreign students.The work will benefit the standardization of Chinese character glyph description and will found wide application in various fields based on Chinese character glyph computing, such as the input of Chinese characters outside of the standard character set, the construction of digital libraries in China, the research, the teaching, and international promotion of Chinese, the research into the history of Chinese characters and culture, the informationalized social management, etc.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络