节点文献

复杂的中文文档图像版面分析研究

Research of Layout Analysis on Complex Chinese Document Images

【作者】 党兴

【导师】 龚声蓉;

【作者基本信息】 苏州大学 , 计算机应用技术, 2010, 硕士

【摘要】 光学字符识别(OCR)是一种实现文字自动输入的快捷省力方法,广泛应用于网上资源数据库和数字图书馆的建设。作为OCR进入自动化阶段的首要步骤,版面分析的正确性直接影响到系统输出结果的语义关系和逻辑关系。在各种文档图像中,由于中文文档图像背景、排版的复杂使得版面分析比西文版面难度大。因此对中文版面分析的研究具有重要的理论意义和实用价值。针对现有版面分析中所涉及到的图像倾斜检测、版面分割以及纯文本版面分析等算法容易受版面复杂度影响,本文根据中文版面特点,对中文文档图像版面分析算法进行了深入研究和大量实验,并取得了如下成果:1.现有的最近邻方法进行文档图像倾斜角计算时,由于被选择的最近邻对可能是错误的,导致计算出的倾斜角与实际角度相差较大。本文提出的基于改进的最近邻链方法,根据判断相似连通区之间同行或同列,构造两类相似k最近邻链表,避免了错误的最近邻链对计算角度的干扰,提高了计算倾斜角度的精确性。2.针对传统的游程平滑算法对平滑阈值选取敏感的缺点,提出了基于选择性连通区游程平滑算法,根据区域内、区域间连通区大小、距离特性进行阈值选取,克服了传统游程平滑算法对字体大小、字符间距、图像区域的依赖性,单一背景文档图像版面分割效果得到明显改善。3.已有的复杂背景的彩色文档图像分割算法普遍存在提高运行时间与分割正确率相矛盾的缺点,本文通过改进灰度化算法和基于边缘图像的动态聚类分割方法,克服了灰度化过程时文字区域颜色信息丢失并且仅对边缘图像进行处理,在提高版面分割速度的同时不会降低版面分割正确率。4.现有阅读顺序未知的复杂纯文本图像版面分析算法对参数选取具有敏感性和弱适用性,对此提出了基于SVM区域构造的版面分析算法。算法选取种子连通区作为测试的第一特征逐步构造区域,之后用投影法决定区域内阅读顺序。实验结果表明,提出的方法具有更好的适应性,对复杂的中文版面有满意的分析结果。

【Abstract】 Optical character recognition (OCR) is an implementation of automatic text input faster and easier method, widely used online database and digital libraries. As the first step into the OCR automation phase, the accuracy of layout analysis directly affects the output of the semantic and logical relations. Out of different kinds of document layouts, Chinese document including diversified background and complicated layout is complex which making more difficult in analyzing Chinese document layout than the layout of other alphabetic languages. Thus, the study of layout analysis has important theoretical significance and application value. In order to solve the issues of existed algorithms involved in skew detection, page segmentation and plain text layout analysis which are vulnerable to the layout structure complexity, we do a great deal of experiments and acquired a series of valuable results which can be summarized in the following aspects:1. The precision of existing nearest-neighborhood algorithms for detecting skew angle is low because of selected nearest component maybe wrong. Taking into account that whether the pair of similar components is in the same row or column, improved k-nearest-neighborhood chain algorithm is proposed. This algorithm avoids the interference of mistaken nearest-neighborhood chain, so it improves the accuracy of skew angle.2. In order to remove the disadvantages of traditional run-length smoothing algorithms (RLSA) which are sensitive to the thresholds, we proposed a new constraint run-length smoothing algorithms based on the selective component according to the between-region and within-region distance. The new algorithm overcome the dependence of algorithms to the character size, spacing and the page segmentation under single background is improved.3. By using the improved color-to-gray algorithm and dynamic clustering algorithm based on edge detection we resolve the shortcomings of contradictions between running time and accuracy for page segmentation under complex background. The experiment shows that this new method speed the page segmentation without reducing the accuracy of page segmentation because of overcoming the loss of color information and segmenting only on edge image.4. Most algorithms for document layout analysis were sensitive to the parameters and had weak applicability. In order to make up these deficiencies,we presents an algorithm of region formation based on SVM for analyzing Chinese document. Seed connected components as the first feature for training are selected which can be used to form regions, next our technique decides the reading order by exploiting the projection method. Our extensive experimental results show that our proposed algorithm is more effective to analyze different kinds of document layout than other methods.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2011年 01期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络