节点文献

印刷体数学公式识别系统的设计与实现

【作者】 侯利昌

【导师】 吴微;

【作者基本信息】 大连理工大学 , 计算数学, 2004, 硕士

【副题名】分割、识别与重组

【摘要】 随着计算机的普及,人们越来越多的使用计算机处理日常工作和存储信息。目前广泛应用的OCR系统对手写、印刷体文本都有很高的识别率,已经广泛应用于办公自动化、快速录入等领域,克服了人工输入费时费力的缺点。但是,对于一篇科技文献,其中有大量的数学公式,它们是由特殊的符号、希腊字母、英文字符和数字组成的复杂的结构体。当前的OCR系统只能识别单个字符,还不能分析公式结构,这样识别出来的公式只是一组毫无关系的字符串,失去了它所表达的数学含义。为此,我们提出了一种新的关于表达式识别的设计思想,并给出了完整的算法,将印刷体的数学公式(图像格式)转换成可编辑的电子格式(如LATEX,Word公式编辑器)。 按照表达式识别系统的流程,本文相应的分为以下四部分: 粘连字符的分割。由于纸质文档的印刷质量、纸张的光洁度、扫描仪的分辨率、二值化等因素的影响,扫描得到的图像中的字符可能是粘连的。这为字符识别带来了困难。本文提出用自组织映射作字符分割的方法,对经典的自组织学习规则做了一些改进,使其能以较少的神经元结点、较快的速度逼近粘连字符的白像素点的分布。文中对最短路径分割方法和自组织映射法分割做了对比,后者能分割一些前者不能处理的粘连字符。 特征提取与选择。一个字符图像只是模式空间中的特征,还不能用来分类,必须在它上面提取抗旋转、缩放、平移的几何不变性特征。文中介绍三种常用的矩方法:规则矩、Zernike矩和样条小波矩。通过计算这三种矩可分性度量,发现Zernike矩更适于做字符的特征。文中还介绍了基于神经网络的主分量分析方法,在38维矩特征中选取18维的主特征,保留信息量的同时,大大降低了特征矢量的维数,消除了样本间的相关性,突出了差异性。 字符识别。分类器是整个识别系统的核心。神经网络已经被广泛用于模式识别,克服了当前常用的模式识别方法的缺点,有效提高了识别率。文中用自组织特征映射做字符的粗分类,将特征相近的字符分在一组。然后BP神经网络对各组字符做细分类,识别出同一组的不同字符,有效地提高了分类精度。 公式重构。如何从一组字符中判断它们复杂的结构至今也没有很好的解决。文中将介绍一种新的公式重构的方法。主要包括上下标定位的方法、符合LL(1)文法的数学表达式构成规则和语法分析器。无序的字符串通过语法分析器生成语法树,最终被转换成可编辑的LATEX公式格式。 文章最后,以一定数量的英文数学资料作实验,结果表明该系统具有一定的实际应用价值,但是还有待进一步改进。

【Abstract】 The computerized document-handling systems have been widely used, but few systems have provided functions for recognizing and understanding mathematics expressions printed in document. The system proposed in this article has the ability to recognize mathematics expressions in files scanned directly from paper and to reconstruct the recognized expressions into particular publication format such as LATEX or WORD.The system works as follows :merged-symbol segmentation. Due to the quality of printer, cleanliness of paper, resolution of scanner, binarization etc., symbols in scanned document may be merged, therefore, can not be easily recognized. In this article, we proposed a new method, self-organizing feature map, to segment merged-symbol. By modifying the classic updating rule of self-organizing map, we obtained a network that can approximate the distribution of white-pixels between two symbols in less training time and with less units.feature extraction and selection. A symbol in image file can not be classified directly, cause it is not invariant with respect to image translation, orientation and size changes. In this article, we investigated three kinds of moment features that used as a shape descriptor: regular moments, Zernike moments and B-spline wavelet moments. We also used PCA neural network to select principal features, which reduced dimensions of feature space while retaining useful information.character recognition. Recognizer is key part in our system. Neural networks, which overcome the disadvantages of traditional pattern recognition methods, have been used extensively on OCR and have achieved higher recognition rate. In this article, we used SOFM network as rough-classifier, which classify similar symbols into same group. After that, we used BP network as fine-classifier, which identified symbols within one group.expression formation. So far, the problem of understanding a complicated mathematics expressions in a printed document has not been completely solved yet. We introduced a formation algorithm for locating the superscript and subscript, and for analyzing the two-dimensional layout structure of the symbols within a expression. Then the structure of a recognized expression was represented by a tree structure and the original expression could be reproduced by using a suitable formatter like LATEX.The experimental results at the end of article have demonstrated the feasibility of the system. But the model we proposed still needs further improvement for commercial application.

  • 【分类号】TP391.4
  • 【被引频次】16
  • 【下载频次】246
节点文献中: