节点文献
基于OCR的调查问卷自动识别统计分析系统的开发与设计
Development and Design of Questionnaire Automated Statistical Analysis System Based on OCR
【作者】 董世超;
【导师】 牛连强;
【作者基本信息】 沈阳工业大学 , 计算机应用技术, 2012, 硕士
【摘要】 目前,大部分的调查问卷都是以人工的方式进行数据的统计和分析。当前随着计算机技术的飞速发展,利用计算机技术对调查问卷图像进行识别统计分析已经成为了一种必然趋势。尽管在邮件分拣、银行票据分析、选票统计等应用领域已存在一些基于OCR技术的专用软件系统,但由于调查问卷版面固定、通用性差等特点,使得在实现自动识别方面是存在一定的问题。特别是在识别后的可视化方面,当前研究还不够深入。本文以调查问卷为研究对象重点研究调查问卷的识别统计技术,包括调查问卷版面结构的定义、识别区域的选择以及可视化显示。通过用户定义的调查问卷的描述文件,结合问卷固有信息进行问卷的自动识别统计,对于识别后的数据信息进行可视化显示。在获取调查问卷识别内容的过程中提出利用XML技术作为桥梁实现问卷信息由层次化、半结构化的XML数据转化为关系数据。由于进行识别扫描的图像前要进行图像的倾斜矫正而针对此问题提出在问卷描述文件中定制其特殊点通过其模式匹配实现图像的倾斜矫正。同时对于部分问卷图像的倾斜矫正则利用基于连通区域以及文字行之间的距离固定文字行较长的特点进行倾斜矫正。在XML进行映射生成识别所需内容的过程中主要利用其相关节点集的概念,通过节点直接映射完成由层次半结构化数据到关系数据的转化。调查问卷中的手写内容,则是利用其交截特征和孔洞特征等进行字符的识别。在其识别后利用平行坐标系进行多维数据的可视化显示。对于问卷信息利用平行坐标系进行显示信息重复率高的问题,给出随机扰动公式,对重复信息进行离散处理,最后进行聚类分析划分群组。对于划分后的群组利用刷技术进行不同群组的显示。利用上述研究初步实现了基于OCR调查问卷的识别统计分析系统。
【Abstract】 Currently, a large part of the input data manually statistics during the questionnaire processing. With the high development of computer technology, using computer technology to automatically identify the questionnaire statistical analysis has become an inevitable trend. Although mail sorting, bank notes analysis, statistics and other applications have the votes, there are some special OCR technology based software system, but in the questionnaire format for content is not fixed, automatic identification in certain aspects of the problem. Especially after the visual identification, the current study was not thorough enough.In this paper, the questionnaire for the study questionnaire focuses on the identification of statistical techniques, including a questionnaire layout definition and description of the model structure, identify areas of selection, visual display, user-defined description of the questionnaire file, automatically generate identification documents to identify the contents of the knowledge of statistical analysis of the final analysis, automated identification. Questionnaire for the treatment presents a survey questionnaire template constraint description file with the questionnaire information extraction methods. The image recognition process, the image of the tilt correction is its recognition of the premise, this paper presents the use of their questionnaire template customization through its special point pattern matching of image tilt correction, the deviation or error when the image, the use of questionnaires The distance between text lines in the fixed characteristics of a long line of text images of the tilt correction of the questionnaire. Platform based on XML, without limitation, hierarchical structure, scalability and other properties using XML as a bridge between.Generated through the questionnaire to query the XML content mapping to generate the required identification, the main use of its implementation process related to the concept node set, by direct mapping done by the node level to semi-structured data into relational data. Because the questionnaire identified the handwriting, and in the process of identifying handwritten characters using its cross-sectional features and characteristics of holes for character recognition. After its identification in the parallel coordinate system through the use of multidimensional data visualization, because after the questionnaire data identified a relatively high repetition rate, add random perturbations decimal for display. In the second, based on visual clustering method using the parallel coordinate system in visualization. Achieved using the above preliminary study based on questionnaire OCR recognition statistical analysis system.
【Key words】 image processing; automatic statistical; handwritten numeral recognition; visual display;