

Research on Intelligent Chinese Character-making without Library Based on Topology and Statistics

【作者】 卢建平

【导师】 皮佑国;

【作者基本信息】 华南理工大学 , 控制理论与控制工程, 2010, 博士

【摘要】 以文化和技术角度描述的无字库汉字智能造字理论研究已经取得了丰富的成果,设计开发的汉字智能造字软件系统成功地实现了汉字字符集GB18030-2005中规定的70244个汉字的智能造字实验。为研究汉字智能造字的内在规律性,本文采用了拓扑、统计等数学工具,对研究的汉字基元、汉字结构、汉字编码进行符号化,研究汉字基元理论、结构理论、编码理论和造字理论的合理性、严肃性和稳定性,从而丰富和完善智能造字理论。为了检验智能造字的效果,研究智能造字的降熵机理并对智能造字的信息化效率进行评价。下面介绍作者在论文研究期间所从事的主要工作以及取得的进展:1.汉字基元理论研究。(1)利用拓扑理论对汉字基元进行了数学描述:分析了汉字集合,汉字成分集合和汉字基元集合之间的关系;建立了汉字基元和数学拓扑基之间的关系,为汉字基元的中文命名提供了数学理论上的支撑。(2)建立了可选择汉字基元的数学描述理论,解决了如何在不同的汉字子集合上各自选择基元集而不互相冲突的问题;并说明了汉字基元集合的确定性问题。(3)进一步地,用层次分析法建立了从汉字集合中选择汉字基元的数学模型,解决了实际如何从汉字集合中选择汉字基元的数学问题。(4)汉字基元个数的稳定性。由于汉字基元具有组成的确定性以及实验获取的稳定性的两个特性,用统计模型的可线性化的一元非线性回归预测了汉字基元个数的稳定性。2.汉字结构理论研究。(1)利用拓扑理论对汉字结构进行了数学描述:利用现代拓扑学中商空间、同伦论等理论对汉字智能造字中具有不同拓扑特征的结构类分别进行了研究,形成一套对汉字结构的数学描述理论,从而实现了运用拓扑知识对汉字结构进行数学描述的目标。(2)汉字结构种类的稳定性:由于汉字结构具有定义的确定性和实验获取汉字结构种类的稳定性的两个特性,从汉字拼合方式的拓扑特性预测了汉字结构种类的稳定性。3.汉字编码理论研究。针对汉字智能造字中的编码包括结构编码和基元编码两个部分的特性,(1)从数学上描述了汉字智能造字编码,并说明了汉字智能造字编码是一种“结构+基元”特征的组合编码;(2)从数学理论上验证了智能造字中的汉字编码的机内码是单义可译码和即时码。对GB18030-2005中的全部70244个汉字在编码平台下都有编码且是唯一编码的实验结果,汉字编码理论从数学角度对智能造字中汉字机内码编码的完备性和唯一性予以了解释。4.汉字智能造字理论和系统模型研究。对汉字造字过程进行了数学描述:(1)从拓扑学的角度证明了可以造字的数学命题,解决了汉字造字的数学理论支撑的问题。(2)根据智能造字的理论思想建立了智能造字的数学模型,解决了智能造字的理论从定性描述向数学理论描述的转化问题。可造字的数学理论解释了汉字造字的可实现性,并进一步地建立了汉字智能造字的数学模型,造字实验结果也证实了本章提出的模型方法的可行性和有效性。5.汉字智能造字的降熵机理和信息熵计算。现有的中文信息系统均采用汉字字库方式,以汉字作为最小的处理单元,其静态平均信息熵为9.65比特/字符,是开销最大和效率最低的文字处理系统。在分析了现有汉字字库方式的汉字系统信息熵偏高的原因及其降熵机理的基础上,以汉字基元为汉字处理的单位进行了信息熵实验,得到信息熵是5.29比特/字符,达到与拼音文字相接近的水平,实验结果表明上述方案有效地降低了汉字的信息熵。

【Abstract】 Research on intelligent Chinese character-making (ICC) without library has made the abundant achievement from the angle of the culture and technology. The designed software system of ICC realizes successfully the ICC experiment of 70244 characters which are specified in the Chinese character set of GB18030-2005. In order to research the inherent regularity of ICC, this paper is applied to symbolization of Chinese character prototypes, Chinese character structures and Chinese character code and researches the rationality, seriousness and stability of the Chinese character prototype theory, structure theory, code theory and ICC theory using the mathematical tools such as the topology and statistics. Therefore, it enriches and improves the ICC theory. In order to verify the effectiveness of ICC and research the entropy-dropping mechanism of ICC, the informatization efficiency evaluation of ICC is applied.The main work and achievement during the paper research period is as flows:1. Research on Chinese Character Prototype. (1) Using topological theory to describe the Chinese character prototypes: the relationship among the sets of Chinese characters, components and prototypes is analyzed; the relationship between the prototype and topological basis is established, supporting the Chinese naming of the prototype mathematical theoretically, providing the mathematic theory support for the Chinese naming of the Chinese character prototypes. (2) The mathematical theory of available to chose the prototypes is established, resolving the problem that how to choose prototype sets respectively in the different subsets of Chinese characters without causing any conflicts from each other. (3) Further, the mathematical model how to choose prototypes from the set of characters is established by using AHP (Analytic Hierarchy Process), resolving practically the mathematical problem that how to choose prototypes from the set of characters. (4) The stability of the prototypes. For the certain composition of the prototypes, and the asymptotic stability of the prototypes acquired in the experiment, using the exponential smoothing method in statistical models to predict the stability of the prototypes. The stability of the prototypes is predicted by using the nonlinear regression method that can be linearized in the statistical models.2. Research on Chinese character structures theory. (1) Using topological theory to describe the Chinese character structures: using quotient space and homotopy in modern topology to study on the structures’classes with different topological features in ICC, the mathematical descriptive theories for character structures are formed. The goal that the Chinese character is applied to mathematic description using the topology is achieved. (2) The stability of the structures. From the certain composition of the structures, the stability of the structures acquired in the experiment, and the topological properties of joining together way of characters to predict the stability of the structures.3. Research on Chinese character coding theory. As to the feature of the code of ICC including structure coding and prototype coding, (1) It states mathematically that the coding of ICC is a combinational coding with the feature“structure plus prototype”. (2) It also verifies that the internal code of characters of ICC is a unique decodable code and instantaneous code from the mathematic theory. For the code experiment result in which all the 70244 Chinese characters of the GB18030-2005 have their own codes under the code platform and these codes are unique, the Chinese code theory explains the completeness and uniqueness of the internal code for ICC.4. Research on the ICC theory and the system model. Mathematical description has made to show the process of making-character, Firstly, the mathematical theory which can make character is verified from the angle of topology and the problem of the mathematical theory support of Chinese character-making is resolved. Secondly, the mathematical model of ICC is set up according to the ICC theory and the transition from qualitative description to mathematic theory description of the Chinese character-making theory is resolved. The mathematical theory which can make the Chinese character explains the realizability of the Chinese character-making and the mathematical model of ICC is set up further. Besides, the character-making experiment result also verifies the feasibility and effectiveness of the model method proposed in this chapter.5. Research on the Chinese character entropy-reducing mechanism of ICC. The present Chinese information systems all adopt the Chinese character word library, a word processing system with the most expensive expenditure but the lowest efficiency in which the Chinese character is the smallest processing unit and the average static information entropy is 9.65 bit. On the basis of analysis and research on the reason that the Chinese character system information entropy of current Chinese character word library is on the high side and the entropy-reducing mechanism, the information entropy experiment is carried out by taking the Chinese character prototypes as the Chinese character processing units and gets the information entropy with 5.29 bit which is almost near to the alphabetic writing level. This experiment indicates that the above program reduces the Chinese character information entropy effectively.
