

Research on Business Card Recognition System Based on OCR Technology

【作者】 武玉坤

【导师】 罗晓奔; 张桂平;

【作者基本信息】 长沙理工大学 , 计算机应用技术, 2008, 硕士

【摘要】 在实际的商务和经济活动中,名片已经成为了一个重要的身份信息载体。名片根据语言类型的不同大致可以分为两类:双语混排名片和单语名片。中英混排识别问题是亟待解决的印刷体文字识别问题,且目前的名片版面分析算法运算量大,不适用。本文试图通过研究一些新技术来推进上述问题的解决。本文的主要工作如下:(1)阐述名片识别系统应用的必要性,给出一般名片识别系统的总的框架图,分析名片的总体特征。(2)论文提出基于数学形态学的版面分析方法,利用数学形态学的膨胀运算和搜索算法,实现对复杂名片版面进行快速准确的分析。(3)针对现有字符切分方法在中英文混排环境以及不同字号文字混排情况下不能准确的进行切分等问题,提出一种基于汉字周期及识别反馈的混排文字切分方法。该方法采用基于字符间距周期的中文字符分离算法,实现对连通区域类型的判定,最后采用了基于识别的汉字部件合并算法,完成对左右结构汉字的连通区域合并。实验表明,此方法的字符切分准确率优于传统的基于投影的行字切分算法。(4)在基于启发式规则信息分类算法基础上,利用文本在名片图像的版面位置信息来辅助分类。(5)对论文采用的新算法进行相关实验,验证其性能。

【Abstract】 In the actual commerce and the economic activity, the business card has already become an important status information carrier. The business card may divide into two kinds roughly according to the language type: bilingual languages and one language. The question of the mixture of Chinese and English languages is one of the questions of printing recognition. And the present business card document layout analysis algorithm’s complexity is high, no suitable. This article attempts to solve these problems through to study some new technologies.In the paper the author makes the following contribution:(1) The paper elaborates application necessity of the business card recognition system, gives the common the total frame chart and analyzes business card’s overall characters.(2) The paper presents a method for card document layout analysis based on mathematical morphology. By some morphological operations and search algorithm, the proposed method can analyze a complex business card document layout quickly and accurately.(3) Against the existing character segmentation methods can not be accurate for lines segmentation in Chinese/English mixed environment and the different size of fonts. This paper introduces a novel approach for Chinese/English mixed characters segmentation which based on periods and recognition. The method make use of the Chinese characters separation algorithm based on the character spacing cycle and achieve the determinant of the type of connective region. Finally, the algorithm completed the union of connective region of the Chinese characters by using a new Chinese character component union arithmetic based on recognition. The experiments show that this method of character segmentation accuracy is better than traditional projection based on the lines segmentation algorithm.(4) In this paper, on the basis of heuristic rules-based information classification algorithm , we propose to use layout information in images to improve automated categorization for text information in business cards.(5) Put up related tests for new algorithm of the paper, validating the system performance.

【关键词】 名片识别OCR字符拆分信息分类
【Key words】 BCROCRcharacter segmentationinformation classification
  • 【分类号】TP391.41
  • 【被引频次】10
  • 【下载频次】820

