节点文献

基于统计的汉字识别后处理研究

A Study on Stat.-based Chinese Character Recognition Post-processing

【作者】 彭涛

【导师】 郭宝兰; 田学东;

【作者基本信息】 河北大学 , 计算机应用, 2003, 硕士

【摘要】 随着计算机和网络技术的飞速发展,需要将大量现实生活中各种介质上的文本数字化,为了提高效率,减轻人的负担,出现了OCR技术——即光学字符识别。近年来,汉字OCR研究已经取得了很大的进步,许多商品化的识别系统成功的走向市场。但是,汉字结构复杂且变化性大的特点往往使单字识别率受到一定的限制。只依靠单纯的单字符识别,识别率已经很难得到进一步的提高。需要在单字符识别基础上,利用语言学知识和文本的上下文相关信息进行后处理。 本文介绍了汉字识别后处理的研究意义和后处理的一些方法,并采用基于统计的后处理方法对单字符识别结果进行了后处理。通过对2000年全年的《人民日报》文本(约1930万字)进行二元字字同现统计,得到汉语文本中字与字之间的概率制约关系。根据Markov语言模型,将同现概率这种文本上下文相关信息应用到汉字识别后处理中。对单字符识别得到的结果进行二次加工,在一定程度上提高了整个系统的识别正确率。

【Abstract】 With the development of the computer and network technology at full speed, it is needed to digitize the large amount of text in daily life on various kinds of medium. In order to raise the efficiency and lighten people’s burden, OCR (Optical Character Recognition) technology has appeared. In recent years, Chinese character OCR study had already made heavy progress. A lot of commercialized recognition systems trend market successfully. But the character that Chinese character’s structure is complex and change greatly often restrict the discerning rate of the individual character. Only rely on the single character recognition, raise the discerning rate is already very difficult. Based on the individual character recognition, it is needed for us to do post-processing using language knowledge and context relevant information of text.This thesis introduces the research meaning and some methods of Chinese characters recognition post-processing. And adopt stat.-base method to do the post-processing to the single character recognition result. Through counting all the adjoined two words in "People’s Daily" text of the whole year 2000 (about 19,300,000 words), get the probabilistic relationship between the Chinese characters. According to Markov language model, use this probabilistic relationship between the Chinese characters into Chinese character post-processing. It can raise the discerning rate of the whole system to a certain extent.

  • 【网络出版投稿人】 河北大学
  • 【网络出版年期】2004年 02期
  • 【分类号】TP391.4
  • 【被引频次】8
  • 【下载频次】276
节点文献中: 

本文链接的文献网络图示:

本文的引文网络